1. Introduction

Zillow’s housing market predictions, known as Zestimates, are valued for their nationwide coverage and general accuracy. For example, the nationwide median error for off-market homes is 7.5% and for on-market homes is 1.9%1. However, when considering a specific city or region, the accuracy of the Zestimates could be improved by including locally-specific data in the prediction process. This analysis aims to build an accurate and generalizable hedonic model that predicts home prices for Miami by deconstructing overall home price into the value of constituent parts. Accurate models lead to only small differences between the predicted and observed values, and generalizable models accurately predict on new data and with comparable accuracy across various groups. By working to improve the accuracy and generalizability of the predictive model, we are ultimately striving to create a more useful decision-making tool. Such a model may be useful for local governments as they assess property taxes, for example.

Modeling Strategy

To predict housing prices in the Miami area, we train a model using home prices from recent sales. Our model includes “features,” which are variables that are used to predict outcomes (in this case, home price). We use a hedonic model, which breaks home price into components that explain the cost: physical characteristics (e.g., living area), public services and amenities (e.g., distance to parks), and the spatial process of prices (e.g., prices for homes within neighborhoods cluster together). Our process uses data wrangling, exploratory analysis, feature engineering, feature selection, and model estimate and validation to produce a hedonic model that effectively predicts home prices for Miami and Miami Beach. Key challenges inherent to this process involve identifying and cleaning publicly available data for inclusion, investigating underlying spatial processes and trends, and selecting an effective set of features for inclusion in the model while avoiding choosing variables that are too closely correlated with each other.

Results

Our final model for predicting home prices includes the 20 independent variables shown in the Table below. Overall, the indicators used to evaluate the fit of the model (e.g., MAE) show that our model is not predicting well and is therefore not effective for use. The model was able to predict 85% of the variation in price, and most important features were internal property features, including lot size, actual square footage, whether the home is zoned for single family, and whether the home had an elevator or a dock. Generally, we think that our model’s inability to predict expensive homes is the largest source of the error. This poor performance skews our overall MAE. According to the mapped results, we were unable to effectively account for the spatial variation in prices. The model predicted poorly in Miami Beach and along the mainland coast, which is where many of the expensive houses are located. We would recommend that Zillow make further improvements before employing the model for official use. The model has large errors and does not do a good job of predicting home prices. In order to improve this model, we suggest further exploring features that would better account for the spatial processes at play.

finalvars<- c('Lot Size', 'Number of Bedrooms', 'Number of Bathrooms', 'Home Age', 'Home Size', 'Presence of a Pool', 'Whether the Property is a Single Family Home', 'Presence of Luxury Features', 'Presence of an Elevator', 'Presence of a Dock', 'Estate Zoning', 'Number of Sexual Offenders or Predators within an Eighth Mile of the Home', 'Distance to the Nearest Park', 'Within 0.5 miles of a Metro Stop', 'Median Household Income', 'Percent of Population that is Hispanic', 'Percent of Population with a Bachelors Degree', 'Percent of Households that are Owned', 'Percent of Population that Commutes via Car', 'Percent of Households with a Household Income of $200,000 or more')

vartype<-c('Internal Characteristic', 'Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic','Internal Characteristic', 'Amenities/Public Services', 'Amenities/Public Services', 'Amenities/Public Services', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure', 'Spatial Structure')

datasource<-c('Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Home Data', 'Miami Dade County Open Data', 'OpenStreetMap', 'Miami Dade County Open Data', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey', 'U.S. Census Bureau American Community Survey')

varsdesc<- c(
  'Total Lot Size in Square Feet',
  'Total Number of Bedrooms in the Home',
  'Total Number of Bathrooms in the Home',
  'Age of the Home, as of 2020',
  'Total Size of the Home in Square Feet',
  'Whether or Not the Property Includes a Pool',
  'Whether the Property is a Single-Family Home or a Multi-Family Home',
  'Whether the Home Contains any Luxury Features',
  'Whether the Home Contains an Elevator',
  'Whether the Home Contains a Dock',
  'Whether the Home is Zoned as an Estate',
  'The Number of Sexual Offenders and Predators within an Eighth of a Mile of the Home',
  'Distance (in feet) to the nearest Park',
  'Whether the Home is Within a Half-Mile of a MetroMover or MetroRail Stop',
  'Median Household Income in the Census Tract',
  'Percent of Population in the Census Tract that is Hispanic',
  'Percent of Population in the Census Tract with a Bachelors Degree',
  'Percent of Households in the Census Tract that are Owned',
  'Percent of Population in the Census Tract that Commutes via Car',
  'Percent of Households in the Census Tract with a Household Income of $200,000 or more')

varstable<- data.frame(finalvars, datasource, varsdesc)

varstable%>%
  rename(Final_Variables = finalvars,
         #Variable_Type = vartype,
         Data_Source = datasource,
         Variable_Description = varsdesc)%>%
  kable(caption="Final Independent Variables Included in Home Price Prediction Model")%>%
  pack_rows("Internal Characteristic", 1, 11, label_row_css = "background-color: #25CB10") %>%
  pack_rows("Amenities/Public Services", 12,14, label_row_css = "background-color: #faf500")%>%
  pack_rows("Spatial Structure", 15, 20, label_row_css = "background-color: #FA7800")%>%
  kable_styling("striped", full_width = F)%>%
  scroll_box(height = "250px")
Final Independent Variables Included in Home Price Prediction Model
Final_Variables Data_Source Variable_Description
Internal Characteristic
Lot Size Home Data Total Lot Size in Square Feet
Number of Bedrooms Home Data Total Number of Bedrooms in the Home
Number of Bathrooms Home Data Total Number of Bathrooms in the Home
Home Age Home Data Age of the Home, as of 2020
Home Size Home Data Total Size of the Home in Square Feet
Presence of a Pool Home Data Whether or Not the Property Includes a Pool
Whether the Property is a Single Family Home Home Data Whether the Property is a Single-Family Home or a Multi-Family Home
Presence of Luxury Features Home Data Whether the Home Contains any Luxury Features
Presence of an Elevator Home Data Whether the Home Contains an Elevator
Presence of a Dock Home Data Whether the Home Contains a Dock
Estate Zoning Home Data Whether the Home is Zoned as an Estate
Amenities/Public Services
Number of Sexual Offenders or Predators within an Eighth Mile of the Home Miami Dade County Open Data The Number of Sexual Offenders and Predators within an Eighth of a Mile of the Home
Distance to the Nearest Park OpenStreetMap Distance (in feet) to the nearest Park
Within 0.5 miles of a Metro Stop Miami Dade County Open Data Whether the Home is Within a Half-Mile of a MetroMover or MetroRail Stop
Spatial Structure
Median Household Income U.S. Census Bureau American Community Survey Median Household Income in the Census Tract
Percent of Population that is Hispanic U.S. Census Bureau American Community Survey Percent of Population in the Census Tract that is Hispanic
Percent of Population with a Bachelors Degree U.S. Census Bureau American Community Survey Percent of Population in the Census Tract with a Bachelors Degree
Percent of Households that are Owned U.S. Census Bureau American Community Survey Percent of Households in the Census Tract that are Owned
Percent of Population that Commutes via Car U.S. Census Bureau American Community Survey Percent of Population in the Census Tract that Commutes via Car
Percent of Households with a Household Income of $200,000 or more U.S. Census Bureau American Community Survey Percent of Households in the Census Tract with a Household Income of $200,000 or more

Setup Code

To begin the analysis, this section loads libraries and options, specifies styling options for maps and plots, creates quantile break functions and quantile styling, and loads color palettes. This section also creates the “nearest neighbor” function, which calculates the average nearest neighbor distance from each home to its k nearest objects of interest. This calculation is useful for creating features that describe the relative amount of an amenity around each home. For example, the nearest neighbor function can tell us the average distance from each home to the closest 1, 2, 3, 4, or 5 parks, bars, and crime events. In our exploratory analysis, we determine which nearest neighbor features are most appropriate for inclusion in the model.

Loading Libraries and Options

knitr::opts_chunk$set(
    echo = TRUE)

library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)
library(mapview)
library(ggstance)
library(broom.mixed)
library(osmdata)
library(geosphere)
library(tidycensus)
library(stargazer)
library(kableExtra)
library(kableExtra)
library(stargazer)

options(tigris_class = "sf")

Loading Themes for Map and Plots

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 13,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "lightskyblue1", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

Specifying Nearest Neighbor Function

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

2. Data

This analysis uses data from a variety of sources: we derived physical characteristics of each house from the underlying home data, gathered public services and amenities information from the Miami-Dade County Open Data Hub and OpenStreetMap, and incorporated population data from the U.S. Census Bureau’s 5-year American Community Survey.

Data Gathering: Description & Code

We accessed data for the analysis via the Miami-Dade County Open Data Hub API and using the osmdata package to access OpenStreetMap variables of interest. Most of the data required manipulation and cleaning before being explored. This process included transformations in order to maximize the effectiveness of prediction for each variable. The code below presents the steps in the data gathering process.

text_tabl <-data.frame(
  Datasets = c("Miami Home Dataset", "Coastline Data", "Bars, Pubs, & Restaurant Data", "Sexual Offenders Data", "Miami Park Facilities Data","Miami-Dade Transit (MDT) Metromover and Metrorail stations", "2014-2018 ACS 5-year Estimates"),
  Source = c("Provided", "OpenStreetMap", "OpenStreetMap", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "U.S. Census Bureau")
)

kbl(text_tabl) %>%
  kable_paper(c("hover", "condensed"), full_width = F, html_font = "Montserrat") %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 10) %>%
  row_spec(0, bold = T, color = "white", background = "dodgerblue", font_size = 14)
Datasets Source
Miami Home Dataset Provided
Coastline Data OpenStreetMap
Bars, Pubs, & Restaurant Data OpenStreetMap
Sexual Offenders Data Miami-Dade County Open Data Hub
Miami Park Facilities Data Miami-Dade County Open Data Hub
Miami-Dade Transit (MDT) Metromover and Metrorail stations Miami-Dade County Open Data Hub
2014-2018 ACS 5-year Estimates U.S. Census Bureau

Loading Basemap

# Loading Miami base map from OSM Data
miami.base <- 
  st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson") %>%
  filter(NAME == "MIAMI BEACH" | NAME == "MIAMI") %>%
  st_union()

xmin = st_bbox(miami.base)[[1]]
ymin = st_bbox(miami.base)[[2]]
xmax = st_bbox(miami.base)[[3]]  
ymax = st_bbox(miami.base)[[4]]

Loading Home Data (Provided) and Neighborhood Data

MiamiProperties<-
  st_read("C:/Users/wagne/Documents/GitHub/ParkWagner_MidtermMUSA508/studentsData.geojson")%>%
  #st_read("/Users/davidseungleepark/Library/Mobile Documents/com~apple~CloudDocs/Fall 2020/cpln592/ParkWagner_MidtermMUSA508/studentsData.geojson") %>%
  
  mutate(pool = ifelse(str_detect((XF1), "Pool") | str_detect((XF2), "Pool") | str_detect((XF3), "Pool"), "Pool", "No Pool")) %>% 
  mutate(singlefamily = ifelse(str_detect(Zoning, "SINGLE FAMILY"), "Yes", "No")) %>%
  mutate(estates = ifelse(str_detect(Zoning, "ESTATES"), "Yes", "No")) %>%
  mutate(dock = ifelse(str_detect((XF1), "Dock") | str_detect((XF2), "Dock") | str_detect((XF3), "Dock"), "Dock", "No Dock")) %>%
  mutate(Age = (2020 - EffectiveYearBuilt),0) %>%
  mutate(luxury=ifelse(str_detect((XF1), "Luxury") | str_detect((XF2), "Luxury") | str_detect((XF3), "Luxury"), "Yes", "No")) %>%
  mutate(elevator=ifelse(str_detect((XF1), "Elevator") | str_detect((XF2), "Elevator") | str_detect((XF3), "Elevator"), "Yes", "No")) %>%
  mutate(Gazebo=ifelse(str_detect((XF1), "Gazebo") | str_detect((XF2), "Gazebo") | str_detect((XF3), "Gazebo"), "Yes", "No")) %>%
  mutate(BeachView=ifelse(str_detect((Legal1), "BEACH VIEW") | str_detect((Legal2), "BEACH VIEW") | str_detect((Legal3), "BEACH VIEW"), "Yes", "No")) %>%
  dplyr::select(-Land, -Year,-Bldg, -Total, -Assessed,  -saleDate, -saleType, -saleQual, -County.Senior, -County.LongTermSenior, -County.Other.Exempt, -Owner1, -Owner2, 
                -Mailing.Address, -Mailing.City, -Mailing.State, -Mailing.Zip, -Mailing.Country, -YearBuilt,
                -City.Senior, -City.LongTermSenior, -City.Other.Exempt, -Legal1, -Legal2, -Legal3, -Legal4, -Legal5, -Legal6, -XF1, -XF2, -XF3,
                -WVDB, -HEX, -GPAR, -County.2nd.HEX, -City.2nd.HEX, -MillCode, -Zoning, -Land.Use)

st_crs(MiamiProperties)

# Loading elementary school boundaries 
elementary.school.boundaries <- 
  st_read("https://opendata.arcgis.com/datasets/19f5d8dcd9714e6fbd9043ac7a50c6f6_0.geojson") %>%
  st_transform('ESRI:102658')

#Neighborhood Data
neighborhood<-
  st_read("https://opendata.arcgis.com/datasets/2f54a0cbd67046f2bd100fb735176e6c_0.geojson")%>%
  st_transform('ESRI:102658')%>%
  dplyr::select(LABEL)%>%
  rename(Neighborhood=LABEL)

muni_boundary<-
  st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson")%>%
  filter(NAME=="MIAMI BEACH")%>%
  st_transform('ESRI:102658')%>%
  dplyr::select(NAME)%>%
  rename(Neighborhood=NAME)

all_nhoods<-
  rbind(neighborhood,muni_boundary)

#Miami Beach Neighborhoods
MBAreas<-st_read("https://opendata.arcgis.com/datasets/a21e846f4e8e4d81ad3b75bc4f334516_0.geojson")%>%
  filter(FID>1)%>%
  filter(FID<130)%>%
  filter(LAND=="Y")%>%
  st_transform('ESRI:102658')%>%
  dplyr::select(FID)%>%
  rename(Neighborhood=FID)

miami.base_map<-
  miami.base%>%
  st_transform('ESRI:102658')

elem_map<-
  elementary.school.boundaries%>%
  st_crop(miami.base_map)%>%
  rename(elem_name=NAME)

all_nhoods_MB<-
  rbind(neighborhood,MBAreas)%>%
  st_crop(miami.base_map)

Calculating Distance to Coastline; Calculating proximity to Bars, Pubs, and Restaurants; Sexual Offenses; Parks; and Public Transportation

## Calculate the distance to Coastline (this calculation has to be in WGS84)
Coastline<-opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
  add_osm_feature("natural", "coastline") %>%
  osmdata_sf()

### add to MiamiProperties and convert to miles
MiamiProperties <-
  MiamiProperties %>%
  mutate(CoastDist=(geosphere::dist2Line(p=st_coordinates(st_centroid(MiamiProperties)),
                                        line=st_coordinates(Coastline$osm_lines)[,1:2])*0.00062137)[,1])

##Convert Miami Data to Local Projection #st_transform('ESRI:102658')
MiamiProperties <-
  MiamiProperties%>%
  st_transform('ESRI:102658')

MiamiProperties<-
  MiamiProperties%>%
  mutate(milecoast=ifelse(CoastDist<1,"Yes","No"))%>%
  mutate(halfmilecoast=ifelse(CoastDist<0.5,"Yes","No"))

## Add Data on Bars, pubs, and restaurants
bars <- opq(bbox = c(xmin, ymin, xmax, ymax)) %>% 
  add_osm_feature(key = 'amenity', value = c("bar", "pub", "restaurant")) %>%
  osmdata_sf()

bars<-
  bars$osm_points%>%
  .[miami.base,]

bars<-
  bars%>%
  st_transform('ESRI:102658')
  
MiamiProperties<-
  MiamiProperties %>%
  mutate(
    bars_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),1),
    bars_nn2= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),2),
    bars_nn3= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),3),
    bars_nn4= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),4),
    bars_nn5= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(bars),5))

MiamiProperties$bars_Buffer =
  st_buffer(MiamiProperties, 660) %>%
  aggregate(mutate(dplyr::select(bars), counter = 1),., sum) %>%
  pull(counter)

MiamiProperties<-
  MiamiProperties%>%
  mutate(bars_Buffer = replace_na(bars_Buffer, 0))

## Add data on crime- Sexual Offenders and Predators within Miami-Dade County point data 
miami.sexualoffenders <-  
  st_read("https://opendata.arcgis.com/datasets/f8759d722aeb4198bfe7c4ad780604d2_0.geojson") %>%
  filter(CITY == "MIAMI" | CITY == "MIAMI BEACH" | CITY == "Miami" | CITY == "Miami Beach") %>%
  st_transform('ESRI:102658')

MiamiProperties$sexualoffenders_Buffer =
  st_buffer(MiamiProperties, 660) %>%
  aggregate(mutate(dplyr::select(miami.sexualoffenders), counter = 1),., sum) %>%
  pull(counter)

MiamiProperties<-
  MiamiProperties%>%
  mutate(sexualoffenders_Buffer = replace_na(sexualoffenders_Buffer, 0))

MiamiProperties <-
  MiamiProperties %>% 
  mutate(
    crime_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),1),
    crime_nn2= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),2),
    crime_nn3= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),3),
    crime_nn4= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),4),
    crime_nn5= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(miami.sexualoffenders),5))

## Add the data on Park Facilities
Parks<-st_read("https://opendata.arcgis.com/datasets/8c9528d3e1824db3b14ed53188a46291_0.geojson")%>%
st_transform('ESRI:102658')

MiamiProperties<-
  MiamiProperties %>% 
  mutate(
    parks_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),1),
    parks_nn2= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),2),
    parks_nn3= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),3),
    parks_nn4= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),4),
    parks_nn5= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(Parks),5))

MiamiProperties$parks.Buffer =
  st_buffer(MiamiProperties, 660) %>%
  aggregate(mutate(dplyr::select(Parks), counter = 1),., sum) %>%
  pull(counter)

MiamiProperties<-
  MiamiProperties%>%
  mutate(parks.Buffer = replace_na(parks.Buffer, 0))

#TOD or non-TOD; distance to transit stop?
metrorail_stop<-st_read("https://opendata.arcgis.com/datasets/ee3e2c45427e4c85b751d8ad57dd7b16_0.geojson")%>%
  st_transform('ESRI:102658')%>%
  dplyr::select(NAME)

metromover_stop<-st_read("https://opendata.arcgis.com/datasets/aec76104165c4e879b9b0203fa436dab_0.geojson")%>%
  st_transform('ESRI:102658')%>%
  dplyr::select(NAME)

metro_stops<-
  rbind(metromover_stop,metrorail_stop)

#Distance to metro_stop (Added a column for the distance to the nearest stop and a column for homes that are within 0.5 miles of a stop)
MiamiProperties<-
  MiamiProperties %>% 
  mutate(
    metro_nn1= nn_function(st_coordinates(st_centroid(MiamiProperties)),st_coordinates(metro_stops),1),
    Halfmile_metro=ifelse(metro_nn1<2640,"Halfmile_metro","Non_Halfmile_metro"))

Adding Neighborhood and Elementary School Names to Each Home Update this to include the correct centroid function

#Add Elementary School Name to Each Property
elementary.school.clean<-
  elementary.school.boundaries%>%
  dplyr::select(NAME)%>%
  rename(elem_name=NAME)

MiamiProperties <- st_join(st_centroid(MiamiProperties), elementary.school.clean, left = TRUE)

#Add neighborhood name to each property
MiamiProperties <- st_join(st_centroid(MiamiProperties), all_nhoods, left = TRUE)

#Add variable for Miami or Miami Beach
MiamiProperties<-
  MiamiProperties%>%
  mutate(MiamiBeach=ifelse(Property.City=="Miami Beach","Yes","No"))

Loading Census Data

## Load in census data
census_api_key("41e1c0d912341017fa6f36a5da061d3b23de335e", overwrite = TRUE)

selected_vars <- c("B02001_001E", # Estimate!!Total population by race -- ##let's double check that it's okay to use this as long as we justify it
                   "B02001_002E", # People describing themselves as "white alone"
                   "B02001_003E", # People describing themselves as "black" or "african-american" alone
                   "B15001_050E", # Females with bachelors degrees
                   "B15001_009E", # Males with bachelors degrees
                   "B19013_001E", # Median HH income
                   "B25058_001E", # Median rent
                   "B06012_002E", # Total poverty
                   "B08301_001E", # People who have means of transportation to work
                   "B08301_002E", # Total people who commute by car, truck, or van
                   "B08301_010E", # Total people who commute by public transportation"
                   "B03002_012E", # Estimate Total Hispanic or Latino by race
                   "B19326_001E", # Median income in past 12 months (inflation-adjusted)
                   "B07013_001E", # Total households
                   "B07013_002E", # Total owner-occupied households
                   "B07013_003E", # total renter-occupied households
                   "B25027_001E",
                   "B25027_010E",
                   "B25038_002E",
                   "B25038_003E",
                   "B25038_004E",
                   "B25038_005E",
                   "B19001_017E")

tracts18 <- 
  get_acs(geography = "tract", 
          variables = selected_vars, 
          year=2018, 
          state=12,
          county = 086,
          geometry=T, 
          output="wide") %>%
  st_transform('ESRI:102658') %>%
  rename(TotalPop = B02001_001E, 
         Whites = B02001_002E,
         Blacks = B02001_003E,
         FemaleBachelors = B15001_050E, 
         MaleBachelors = B15001_009E,
         MedHHInc = B19013_001E, 
         MedRent = B25058_001E,
         TotalPoverty = B06012_002E,
         TotalCommute = B08301_001E,
         CarCommute = B08301_002E,
         PubCommute = B08301_010E,
         TotalHispanic = B03002_012E,
         MedInc = B19326_001E,
         TotalHH = B07013_001E,
         OwnerHH = B07013_002E,
         RenterHH = B07013_003E,
         #TotalHH2 = B25027_001E,
         HHNoMort = B25027_010E,
         Own2017later = B25038_003E,
         Own201516 = B25038_004E,
         Own201014 = B25038_005E,
         HH200k = B19001_017E) %>%
  dplyr::select(-NAME, -starts_with("B0"), -starts_with("B1"), -starts_with("B2")) %>%
  mutate(pctWhite = (ifelse(TotalPop > 0, Whites / TotalPop,0))*100,
         pctBlack = (ifelse(TotalPop > 0, Blacks / TotalPop,0))*100,
         pctHis = (ifelse(TotalPop >0, TotalHispanic/TotalPop,0))*100,
         pctBlackorHis = (ifelse (TotalPop>0, (Blacks+TotalHispanic)/TotalPop,0)) *100,
         pctBachelors = (ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop),0)) *100,
         pctPoverty = (ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0))*100,
         pctCarCommute = (ifelse(TotalCommute > 0, CarCommute / TotalCommute,0))*100,
         pctPubCommute = (ifelse(TotalCommute > 0, PubCommute / TotalCommute,0))*100,
         pctOwnerHH = (ifelse(TotalHH > 0, OwnerHH / TotalHH,0))*100,
         pctRenterHH = (ifelse(TotalHH > 0, RenterHH / TotalHH,0))*100,
         pctNoMortgage = (ifelse(TotalHH > 0, HHNoMort / TotalHH,0))*100,
         pctOwnerSince2010 = (ifelse(OwnerHH > 0, ((Own2017later + Own201516 + Own201014) / OwnerHH),0)) *100,
         pctHH200kOrMore = (ifelse(TotalHH > 0, (HH200k/ TotalHH),0))*100,
         year = "2018") %>%
  mutate(MedHHInc = replace_na(MedHHInc, 0),
         MedRent = replace_na(MedRent,0),
         pctBachelors= replace_na(pctBachelors,0),
         pctHis= replace_na(pctHis,0),
         pctOwnerHH= replace_na(pctOwnerHH,0),
         pctCarCommute= replace_na(pctCarCommute,0),
         pctHH200kOrMore= replace_na(pctHH200kOrMore,0)) %>%
  dplyr::select(-Whites, -Blacks, -FemaleBachelors, -MaleBachelors, -TotalPoverty, -CarCommute, -PubCommute, -TotalCommute, -TotalHispanic)

#merge the data into the MiamiProperties dataframe
MiamiProperties <-st_join((st_centroid(MiamiProperties)),tracts18, left =TRUE) %>%
                    mutate(MedHHInc = replace_na(MedHHInc, 0),
                           MedRent = replace_na(MedRent,0),
                           pctBachelors= replace_na(pctBachelors,0),
                           pctHis= replace_na(pctHis,0),
                           pctOwnerHH= replace_na(pctOwnerHH,0),
                           pctCarCommute= replace_na(pctCarCommute,0),
                           pctHH200kOrMore= replace_na(pctHH200kOrMore,0))

Completing Feature Engineering and Filtering Out Homes Where Sales Prices Have Been Removed

## create dataframe of homes to predict
MiamiPropertiesPred<-
  MiamiProperties %>%
  filter(toPredict == 1)

## create dataframe of rest of the homes
MiamiProperties<-
  MiamiProperties%>%
  filter(toPredict == 0)
<<<<<<< HEAD =======

Variables

Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the stargazer package for this.

>>>>>>> 30a48337056764284380a80a0436d5403662727e

Variables

This is a table of all the engineered features that we considered including in the model.

engin_vars_table <-data.frame(
  Variable = c("CoastDist", "milecoast", "bars_nn1 ~ 5", "bars_Buffer", "crime_nn1 ~ 5","sexualoffenders_Buffer", "park_nn1 ~ 5", "parks.Buffer" , "metro_nn1", "Halfmile_metro","pctWhite", "pctBlack", "pctHis", "pctBlackorHis", "pctBachelors","pctPoverty", "pctCarCommute", "pctPubCommute", "pctOwnerHH", "pctRenterHH", "pctNoMortgage", "pctOwnerSince2010", "pctHH200kOrMore", "pool", "singlefamily", "estates", "dock", "Age", "luxury", "elevator", "BeachView"),
  Dataset = c("OpenStreetMap", "OpenStreetMap", "OpenStreetMap", "OpenStreetMap", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "Miami-Dade County Open Data Hub", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "U.S. Census Bureau", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data", "Provide Miami Housing Data"),
  Description = c("Distance to coast", "Units within a mile from the coast", "Average distances of 1-5 bars", "Number of bars within half mile from a housing units", "Average distances of 1-5 registered sexual offenders", "Number of sexual offenders within half mile from a housing unit", "Average distances of 1-5 parks", "Number of parks within half mile from the unit", "Distance to closest metrorail station", "Units within half mile from a metrorail stations", "Percentage of White population", "Percentage of Black population", "Percentage of Hispanic populaiton", "Percentage of Black and Hispanic population", "Percentage of population with at least Bachelor's degree", "Percentage of population below the poverty line", "Percentage of population communiting by car", "Percentage of population commuting by public transit", "Percentage of households in owner-occupied units", "Percentage of households in renter-occupied units", "Percentage of households with no mortgage", "Percentage of households who owned the house since 2010 or later", "Percentage of households with income of $200K or more", "Units with a pool", "Single Family zoning units", "Estates Zoning Units", "Units with a dock", "Age of the units", "Units with luxuary amenitiss", "Units with an elevator", "Units with a beach view"),
  "Data type" = c("continuous", "categorical", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "categorical", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous", "continuous","continuous", "continuous", "continuous", "continuous", "continuous", "continous", "categorical", "categorical", "categorical", "categorical", "continous", "categorical", "categorical", "categorical"),
  "Final model" = c("No", "No", "No", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "No", "Yes", "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No")
)

kbl(engin_vars_table) %>%
  kable_paper(c("striped", "hover"), html_font = "Montserrat", full_width = F) %>%
  kable_material() %>%
  kable_styling (bootstrap_options = "striped", "condensed", font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "dodgerblue", font_size = 13)%>%
  column_spec(1, bold = T) %>%
  column_spec(2, width = "25%") %>%
  column_spec(3, width = "40%")
Variable Dataset Description Data.type Final.model
CoastDist OpenStreetMap Distance to coast continuous No
milecoast OpenStreetMap Units within a mile from the coast categorical No
bars_nn1 ~ 5 OpenStreetMap Average distances of 1-5 bars continuous No
bars_Buffer OpenStreetMap Number of bars within half mile from a housing units continuous No
crime_nn1 ~ 5 Miami-Dade County Open Data Hub Average distances of 1-5 registered sexual offenders continuous No
sexualoffenders_Buffer Miami-Dade County Open Data Hub Number of sexual offenders within half mile from a housing unit continuous Yes
park_nn1 ~ 5 Miami-Dade County Open Data Hub Average distances of 1-5 parks continuous Yes
parks.Buffer Miami-Dade County Open Data Hub Number of parks within half mile from the unit continuous No
metro_nn1 Miami-Dade County Open Data Hub Distance to closest metrorail station continuous No
Halfmile_metro Miami-Dade County Open Data Hub Units within half mile from a metrorail stations categorical Yes
pctWhite Miami-Dade County Open Data Hub Percentage of White population continuous No
pctBlack U.S. Census Bureau Percentage of Black population continuous No
pctHis U.S. Census Bureau Percentage of Hispanic populaiton continuous Yes
pctBlackorHis U.S. Census Bureau Percentage of Black and Hispanic population continuous No
pctBachelors U.S. Census Bureau Percentage of population with at least Bachelor’s degree continuous Yes
pctPoverty U.S. Census Bureau Percentage of population below the poverty line continuous No
pctCarCommute U.S. Census Bureau Percentage of population communiting by car continuous Yes
pctPubCommute U.S. Census Bureau Percentage of population commuting by public transit continuous No
pctOwnerHH U.S. Census Bureau Percentage of households in owner-occupied units continuous Yes
pctRenterHH U.S. Census Bureau Percentage of households in renter-occupied units continuous No
pctNoMortgage U.S. Census Bureau Percentage of households with no mortgage continuous No
pctOwnerSince2010 U.S. Census Bureau Percentage of households who owned the house since 2010 or later continuous No
pctHH200kOrMore U.S. Census Bureau Percentage of households with income of $200K or more continous Yes
pool Provide Miami Housing Data Units with a pool categorical Yes
singlefamily Provide Miami Housing Data Single Family zoning units categorical Yes
estates Provide Miami Housing Data Estates Zoning Units categorical Yes
dock Provide Miami Housing Data Units with a dock categorical Yes
Age Provide Miami Housing Data Age of the units continous Yes
luxury Provide Miami Housing Data Units with luxuary amenitiss categorical Yes
elevator Provide Miami Housing Data Units with an elevator categorical Yes
BeachView Provide Miami Housing Data Units with a beach view categorical No

Correlation:

As part of the exploratory analysis, we examined correlation to assist in identifying features that may be useful for predicting sales price. A correlation matrix helps us visualize correlation across numeric variables. In the figures, the darker colors imply stronger correlation. These plot helps to determine features that are correlated to sale prices (see sale prices row) and variables that are correlated with each other. The first figure depicts a correlation plot of variables derived from the U.S. Census Bureau. The second figure depicts the correlation between the numeric features included in the final model. We developed several correlation matrices with additional variables in order to help determine features to include in the final model.

#Correlation of Census Variables
<<<<<<< HEAD

censusVars <- 
  select(st_drop_geometry(MiamiProperties), SalePrice, MedHHInc, MedRent, pctWhite, pctBlack, pctHis, pctBlackorHis, 
          pctBachelors, pctPoverty, pctCarCommute, pctPubCommute, pctOwnerHH, pctRenterHH, pctNoMortgage, pctOwnerSince2010, pctHH200kOrMore) %>% na.omit()
ggcorrplot(
  round(cor(censusVars), 1), 
  p.mat = cor_pmat(censusVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation Across Census Variables",
       caption="Correlation Plot of Census Variables") 

======= censusVars <- select(st_drop_geometry(MiamiProperties), SalePrice, MedHHInc, MedRent, pctWhite, pctBlack, pctHis, pctBlackorHis, pctBachelors, pctPoverty, pctCarCommute, pctPubCommute, pctOwnerHH, pctRenterHH) %>% na.omit() ggcorrplot( round(cor(censusVars), 1), p.mat = cor_pmat(censusVars), colors = c("#25CB10", "white", "#FA7800"), type="lower", insig = "blank") + labs(title = "Correlation Across Census Variables", caption="Figure 1. Correlation Plot of Census Variables")

>>>>>>> 30a48337056764284380a80a0436d5403662727e
#Correlation of Variables Included in Model
reg_final_vars <- 
  select(st_drop_geometry(MiamiProperties), SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pctOwnerHH, pctCarCommute, pctHH200kOrMore, sexualoffenders_Buffer) %>% na.omit()

ggcorrplot(
  round(cor(reg_final_vars), 1), 
  p.mat = cor_pmat(reg_final_vars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation Across Final Numeric Features",
<<<<<<< HEAD
       caption="Correlation Plot of All Numeric Features Included in Final Model") 

The charts below plot sales price as a function of numeric features. The first figure is an example of a diagnostic tool used to select features for inclusion in the final model. The series of plots show the relationship between sale price and 6 crime variables for sexual offender and predators within the region (5 nearest neighbor variables and one buffer variable that counts the number of offenses within 1/8 mile of each home). By reviewing the slopes of the plotted lines, we selected the crime variable that was most highly correlated with Sale Prices to include in the final model.

The second figure depicts four correlation scatterplots between continuous features and sale prices that suggest correlation. In all cases, the regression line slopes upward from left to right, meaning that on average, as the variable of interest (e.g., home size, median household income, distance to the nearest park, and household income greater than $200,000) increases, so does price. Correlation can also be described by the slope of the line. The greater the slope, the greater the feature’s effect on price.

======= caption="Figure 2. Correlation Plot of All Numeric Features Included in Final Model")

The charts below plot sales price as a function of numeric features. Figure 3 is an example of a diagnostic tool used to select features for inclusion in the final model. The series of plots show the relationship between sale price and 6 crime variables for sexual offender and predators within the region (5 nearest neighbor variables and one buffer variable that counts the number of offenses within 1/8 mile of each home). By reviewing the slopes of the plotted lines, we selected the crime variable that was most highly correlated with Sale Prices to include in the final model.

Figure 4 depicts four correlation scatterplots between continuous features and sale prices that suggest correlation. In all cases, the regression line slopes upward from left to right, meaning that on average, as the variable of interest (e.g., home size, median household income, distance to the nearest park, and household income greater than $200,000) increases, so does price. Correlation can also be described by the slope of the line. The greater the slope, the greater the feature’s effect on price.

>>>>>>> 30a48337056764284380a80a0436d5403662727e
#Correlation Scatterplots for all NN variable for Crime, as an example:
st_drop_geometry(MiamiProperties) %>% 
  dplyr::select(SalePrice, crime_nn1, crime_nn2, crime_nn3, crime_nn4, crime_nn5, sexualoffenders_Buffer) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Correlation between Sale Price and Crime Features",
       caption="Correlation Scatterplot Depicting the Relationship Between Sale Price and Crime Features") +
  plotTheme()+
  theme(strip.text.x = element_text(size = 10))
<<<<<<< HEAD

=======

>>>>>>> 30a48337056764284380a80a0436d5403662727e
#Correlation Scatterplots for Variables of Interest
st_drop_geometry(MiamiProperties) %>% 
  mutate(Age = 2020 - EffectiveYearBuilt) %>%
  dplyr::select(SalePrice, LivingSqFt, parks_nn1, MedHHInc, pctHH200kOrMore) %>%
  #filter(SalePrice <= 1000000, Age < 500) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Price as a Function of Continuous Variables",
       caption="Correlation between Selected Continuous Variables and Sale Price") +
  plotTheme()+
  theme(strip.text.x = element_text(size = 10))
<<<<<<< HEAD

=======

>>>>>>> 30a48337056764284380a80a0436d5403662727e

Some of the features included in the analysis are categorical rather than numeric (see Figure 5). Instead of using slope to calculate correlation, we evaluate the difference in mean price across each category. Differences in mean sales prices across categories suggest that the variable may be a good predictor of Sale Prices.

st_drop_geometry(MiamiProperties) %>%
  dplyr::select(SalePrice, pool, elevator, luxury)%>%
  gather(Variable,Value, -SalePrice)%>%
  ggplot(aes(Value, SalePrice))+
  geom_bar(position="dodge",stat="summary", fun.y="mean")+
  facet_wrap(~Variable, ncol=3, scales="free")+
  labs(title = "Price as a Function of Categorical Variables",
       caption="Correlation between Selected Categorical Variables and Average Sale Price") +
  plotTheme()
<<<<<<< HEAD

The figure below depicts the spatial distribution of sale prices (or price per square foot) across Miami and Miami Beach. The map suggests that prices are clustered within the city, rather than randomly distributed. In other words, homes with a greater price per square foot (orange color) are grouped together (e.g., on the eastern coast of the mainland and within Miami Beach) and homes with a lower price per square foot (green color) are grouped together father inland.

=======

Figure 6 depicts the spatial distribution of sale prices (or price per square foot) across Miami and Miami Beach. The map suggests that prices are clustered within the city, rather than randomly distributed. In other words, homes with a greater price per square foot (orange color) are grouped together (e.g., on the eastern coast of the mainland and within Miami Beach) and homes with a lower price per square foot (green color) are grouped together father inland.

>>>>>>> 30a48337056764284380a80a0436d5403662727e
Miami.Plot<-
  MiamiProperties%>%
  mutate(PricePerSq=SalePrice/ActualSqFt)

ggplot()+
  geom_sf(data=all_nhoods_MB, fill="grey40")+
  geom_sf(data=Miami.Plot, aes(colour=q5(PricePerSq)),
          show.legend="point", size=.75)+
  scale_colour_manual(values=palette5,
                      labels=qBr(Miami.Plot, "PricePerSq"),
                      name="Quintile\nBreaks")+
  labs(title="Price Per Square Foot, Miami",
       caption="Price per Square Foot for Homes in Miami and Miami Beach") +
  mapTheme()
<<<<<<< HEAD

The maps below depict independent variables included in the final model. The first map shows the density of sexual offenders and predators in Miami and Miami Beach. The locations with the greatest density of sexual offenders and predators in the northern portions of Miami. The second map shows the locations of single-family and multi-family homes. Many single family homes are clustered along the shorline, with multi-family homes largely located further inland. The third maps shows the locations of County, Municipal, State, and Federal Park Facilities in the area. Parks are fairly evenly distributed throughout the area, with a slightly lesser density of parks further inland.

sexual_offend_map<-
  miami.sexualoffenders%>%
  st_intersection(all_nhoods)

Parks_map<-
  Parks%>%
  st_intersection(all_nhoods_MB)

grid.arrange(ncol = 2,
             ggplot() + geom_sf(data = all_nhoods_MB, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(sexual_offend_map)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Locations of Sexual Offenders and Predators",
       caption="Density of Sexual Offenders in Miami and Miami Beach") +
  mapTheme(),
  ggplot()+
  geom_sf(data=all_nhoods_MB, fill="grey40")+
  geom_sf(data=Miami.Plot, aes(colour=singlefamily),
          show.legend="point", size=.75)+
  labs(title="Single vs. Multi Family Homes",
       caption="Figure 9. Single and Multi Family Homes in Miami and Miami Beach") +
  mapTheme())

ggplot()+
    geom_sf(data=all_nhoods_MB, fill="grey40")+
    geom_sf(data=Parks_map, color="lightgreen", show.legend="point", size=0.75)+
    labs(title="Park Locations",
       catption="Locations of Park Facilities in Miami and Miami Beach") +
    mapTheme()

=======

The maps below depict independent variables included in the final model. Figure 7 depicts the density of sexual offenders and predators in Miami and Miami Beach. The locations with the greatest density of sexual offenders and predators in the northern portions of Miami. Figure 8 shows the locations of County, Municipal, State, and Federal Park Facilities in the area. Parks are fairly evenly distributed throughout the area, with a slightly lesser density of parks further inland. Figure 9 shows the locations of single-family and multi-family homes. Many single family homes are clustered along the shorline, with multi-family homes largely located further inland.

#density of sexual offenders
sexual_offend_map<-
  miami.sexualoffenders%>%
  st_intersection(all_nhoods)

ggplot() + geom_sf(data = all_nhoods_MB, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(sexual_offend_map)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Locations of Sexual Offenders and Predators",
       caption="Figure 7. Density of Sexual Offenders in Miami and Miami Beach") +
  mapTheme()

#Parks
Parks_map<-
  Parks%>%
  st_intersection(all_nhoods_MB)
ggplot()+
  geom_sf(data=all_nhoods_MB, fill="grey40")+
  geom_sf(data=Parks_map, size=1, color="lightgreen")+
  labs(title="Park Locations",
       catption="Figure 8. Locations of Park Facilities in Miami and Miami Beach") +
  mapTheme()

#Single Family Homes vs. Multi Family Homes
ggplot()+
  geom_sf(data=all_nhoods_MB, fill="grey40")+
  geom_sf(data=Miami.Plot, aes(colour=singlefamily),
          show.legend="point", size=.75)+
  labs(title="Single vs. Multi Family Homes",
       caption="Figure 9. Single and Multi Family Homes in Miami and Miami Beach") +
  mapTheme()

>>>>>>> 30a48337056764284380a80a0436d5403662727e

3. Methods

Model Building

After exploring and testing correlations of all the variables, we have started our model building process by looking at only at physical attributes of the housing units (reg1). Then we added additional engineered variables to see how the model performs (reg2). In reg3, we included almost all variables, even the ones we have determined to have strong collinearity, just to see how the model with too many variables that are collinear would perform.

<<<<<<< HEAD

After our initial exploratory models, we used the stepwise backward method to find the statistically best model. The stepwise backward function drops each individual variable at a time to find a model with the least AIC (Akaike information criterion) which estimates the quality of each model relative to other models. the output of this process, reg4 was the model with the smallest AIC value, indicating this output is the statistically best model.

The variables with crime factors were not part of the output from the stepwise backward function. However, we made a judgment call that crime (proximity or density of sexual offenders as a proxy) is important factor in the predicting the price of a home. So we decided to make a model (reg5) that includes “crime_nn5”, which indicates the average distances of the 5 nearest sexual offenders, and another model (reg6) that includes “sexualoffenders_Buffer”, which indicates the number of sexual offenders in a half mile buffer of a home.

=======
## model with physical attributes of the housing
reg1 <- lm(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>% 
                  dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, elevator, dock, BeachView, pool, luxury, singlefamily, estates))
summary(reg1) %>%
xtable() %>%
kable()
Estimate Std. Error t value Pr(>|t|)
(Intercept) 504740.78752 112369.359446 4.4918009 0.0000074
LotSize 48.00983 5.175078 9.2771209 0.0000000
Bed -177431.45740 26364.889763 -6.7298388 0.0000000
Bath 130667.03461 27776.336931 4.7042573 0.0000027
Age -1692.83970 663.550540 -2.5511843 0.0107923
ActualSqFt 601.23373 21.615105 27.8154439 0.0000000
elevatorYes 1459061.48261 177745.166004 8.2087267 0.0000000
dockNo Dock -938449.77490 90502.206638 -10.3693579 0.0000000
BeachViewYes -87089.15501 151595.424267 -0.5744841 0.5656897
poolPool 1660.53867 47763.736934 0.0347657 0.9722692
luxuryYes 3943743.59377 176617.755093 22.3292590 0.0000000
singlefamilyYes -243110.73077 34594.601127 -7.0274182 0.0000000
estatesYes 3432949.24232 203207.417448 16.8938186 0.0000000
## add addtional engineered variables
reg2 <- lm(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>% 
                  dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, 
                                parks_nn1, MedRent, MedHHInc, pctBlackorHis, pctBachelors, pool, 
                                pctOwnerHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010, pctCarCommute, pctPubCommute, 
                                pctPoverty, Halfmile_metro, CoastDist, pctHH200kOrMore, elevator, dock, luxury, BeachView))
summary(reg2)
## 
## Call:
## lm(formula = SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>% 
##     dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, 
##         parks_nn1, MedRent, MedHHInc, pctBlackorHis, pctBachelors, 
##         pool, pctOwnerHH, crime_nn5, milecoast, pctNoMortgage, 
##         pctOwnerSince2010, pctCarCommute, pctPubCommute, pctPoverty, 
##         Halfmile_metro, CoastDist, pctHH200kOrMore, elevator, 
##         dock, luxury, BeachView))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6150950  -243991    23992   269054 11668506 
## 
## Coefficients:
##                                      Estimate   Std. Error t value
## (Intercept)                       1732717.954   404062.943   4.288
## LotSize                                82.986        5.127  16.187
## Bed                               -200237.241    27312.324  -7.331
## Bath                               121706.869    28997.633   4.197
## Age                                 -1544.741      685.881  -2.252
## ActualSqFt                            642.300       22.281  28.827
## parks_nn1                              51.400       21.364   2.406
## MedRent                               493.947       52.471   9.414
## MedHHInc                               -7.827        1.451  -5.394
## pctBlackorHis                        1473.638     3050.742   0.483
## pctBachelors                      -137797.338    34187.925  -4.031
## poolPool                          -279862.138    52041.332  -5.378
## pctOwnerHH                           4370.392     2420.360   1.806
## crime_nn5                             -31.101       14.112  -2.204
## milecoastYes                       -45641.078    64424.121  -0.708
## pctNoMortgage                       -2867.908     9708.868  -0.295
## pctOwnerSince2010                   -5777.683     4458.552  -1.296
## pctCarCommute                      -23557.440     3998.039  -5.892
## pctPubCommute                      -11823.922     6115.165  -1.934
## pctPoverty                           6090.940     2810.809   2.167
## Halfmile_metroNon_Halfmile_metro   -42914.658    55350.634  -0.775
## CoastDist                            8803.453    25191.572   0.349
## pctHH200kOrMore                     54814.177    13328.198   4.113
## elevatorYes                       1976762.846   184267.710  10.728
## dockNo Dock                      -1024027.443    94386.813 -10.849
## luxuryYes                         3914685.887   181844.548  21.528
## BeachViewYes                       196961.711   172970.845   1.139
##                                              Pr(>|t|)    
## (Intercept)                         0.000018666680324 ***
## LotSize                          < 0.0000000000000002 ***
## Bed                                 0.000000000000303 ***
## Bath                                0.000027939447231 ***
## Age                                            0.0244 *  
## ActualSqFt                       < 0.0000000000000002 ***
## parks_nn1                                      0.0162 *  
## MedRent                          < 0.0000000000000002 ***
## MedHHInc                            0.000000075011910 ***
## pctBlackorHis                                  0.6291    
## pctBachelors                        0.000057237164173 ***
## poolPool                            0.000000082166985 ***
## pctOwnerHH                                     0.0711 .  
## crime_nn5                                      0.0276 *  
## milecoastYes                                   0.4787    
## pctNoMortgage                                  0.7677    
## pctOwnerSince2010                              0.1951    
## pctCarCommute                       0.000000004301860 ***
## pctPubCommute                                  0.0533 .  
## pctPoverty                                     0.0303 *  
## Halfmile_metroNon_Halfmile_metro               0.4382    
## CoastDist                                      0.7268    
## pctHH200kOrMore                     0.000040329335504 ***
## elevatorYes                      < 0.0000000000000002 ***
## dockNo Dock                      < 0.0000000000000002 ***
## luxuryYes                        < 0.0000000000000002 ***
## BeachViewYes                                   0.2549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 829000 on 2599 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8352 
## F-statistic: 512.8 on 26 and 2599 DF,  p-value: < 0.00000000000000022
## include almost all variables
reg3 <- lm(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>% 
                  dplyr::select(SalePrice, LotSize, Bed, Bath, Age, 
                                ActualSqFt, parks_nn1, MedRent, MedHHInc, pctWhite, pctBlack, pctHis, pctBlackorHis, pctBachelors, pool, 
                                singlefamily, pctOwnerHH, pctRenterHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010, pctCarCommute, 
                                pctPubCommute, pctPoverty, Halfmile_metro, metro_nn1, sexualoffenders_Buffer, milecoast, CoastDist, pctHH200kOrMore))
summary(reg3)
## 
## Call:
## lm(formula = SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>% 
##     dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, 
##         parks_nn1, MedRent, MedHHInc, pctWhite, pctBlack, pctHis, 
##         pctBlackorHis, pctBachelors, pool, singlefamily, pctOwnerHH, 
##         pctRenterHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010, 
##         pctCarCommute, pctPubCommute, pctPoverty, Halfmile_metro, 
##         metro_nn1, sexualoffenders_Buffer, milecoast, CoastDist, 
##         pctHH200kOrMore))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4169482  -349561    26271   363778 13967127 
## 
## Coefficients: (2 not defined because of singularities)
##                                      Estimate   Std. Error t value
## (Intercept)                      -1199084.087   598149.608  -2.005
## LotSize                                69.292        5.599  12.377
## Bed                               -311230.112    29612.849 -10.510
## Bath                               149003.993    31812.026   4.684
## Age                                 -3078.212      757.994  -4.061
## ActualSqFt                            907.582       20.603  44.051
## parks_nn1                              64.165       24.291   2.641
## MedRent                               521.641       57.422   9.084
## MedHHInc                               -9.410        1.528  -6.158
## pctWhite                            12467.684     4065.926   3.066
## pctBlack                            22126.357     6677.178   3.314
## pctHis                               8266.271     3931.581   2.103
## pctBlackorHis                              NA           NA      NA
## pctBachelors                      -139740.115    37923.627  -3.685
## poolPool                          -305848.900    57455.036  -5.323
## singlefamilyYes                   -548427.757    47211.072 -11.617
## pctOwnerHH                           9791.028     2839.093   3.449
## pctRenterHH                                NA           NA      NA
## crime_nn5                               3.317       16.844   0.197
## milecoastYes                       -61817.681    73425.550  -0.842
## pctNoMortgage                        2464.264    10851.585   0.227
## pctOwnerSince2010                    4609.115     5023.466   0.918
## pctCarCommute                      -23196.011     4590.258  -5.053
## pctPubCommute                       -7224.647     7335.817  -0.985
## pctPoverty                            157.046     3425.889   0.046
## Halfmile_metroNon_Halfmile_metro   -78563.665    66698.296  -1.178
## metro_nn1                              11.786        4.101   2.874
## sexualoffenders_Buffer             -20107.978    25869.854  -0.777
## CoastDist                          -28645.246    32654.659  -0.877
## pctHH200kOrMore                     59847.631    15142.705   3.952
##                                              Pr(>|t|)    
## (Intercept)                                  0.045103 *  
## LotSize                          < 0.0000000000000002 ***
## Bed                              < 0.0000000000000002 ***
## Bath                                   0.000002959711 ***
## Age                                    0.000050309473 ***
## ActualSqFt                       < 0.0000000000000002 ***
## parks_nn1                                    0.008304 ** 
## MedRent                          < 0.0000000000000002 ***
## MedHHInc                               0.000000000851 ***
## pctWhite                                     0.002189 ** 
## pctBlack                                     0.000933 ***
## pctHis                                       0.035603 *  
## pctBlackorHis                                      NA    
## pctBachelors                                 0.000234 ***
## poolPool                               0.000000110632 ***
## singlefamilyYes                  < 0.0000000000000002 ***
## pctOwnerHH                                   0.000572 ***
## pctRenterHH                                        NA    
## crime_nn5                                    0.843888    
## milecoastYes                                 0.399916    
## pctNoMortgage                                0.820373    
## pctOwnerSince2010                            0.358957    
## pctCarCommute                          0.000000464374 ***
## pctPubCommute                                0.324792    
## pctPoverty                                   0.963440    
## Halfmile_metroNon_Halfmile_metro             0.238946    
## metro_nn1                                    0.004085 ** 
## sexualoffenders_Buffer                       0.437068    
## CoastDist                                    0.380449    
## pctHH200kOrMore                        0.000079487259 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 910100 on 2598 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8035, Adjusted R-squared:  0.8014 
## F-statistic: 393.4 on 27 and 2598 DF,  p-value: < 0.00000000000000022

After our initial exploratory models, we used the stepwise backward method to find the statistically best model. The stepwise backward function drops each individual variable at a time to find a model with the least AIC (Akaike information criterion) which estimates the quality of each model relative to other models. the output of this process, reg4 was the model with the smallest AIC value, indicating this output is the statistically best model.

## Start:  AIC=71250.08
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBlackorHis + 
##     pctBachelors + pool + singlefamily + pctOwnerHH + pctRenterHH + 
##     crime_nn5 + milecoast + pctNoMortgage + pctOwnerSince2010 + 
##     pctCarCommute + pctPubCommute + pctPoverty + Halfmile_metro + 
##     metro_nn1 + sexualoffenders_Buffer + CoastDist + pctHH200kOrMore + 
##     luxury + elevator + dock + estates
## 
## 
## Step:  AIC=71250.08
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBlackorHis + 
##     pctBachelors + pool + singlefamily + pctOwnerHH + crime_nn5 + 
##     milecoast + pctNoMortgage + pctOwnerSince2010 + pctCarCommute + 
##     pctPubCommute + pctPoverty + Halfmile_metro + metro_nn1 + 
##     sexualoffenders_Buffer + CoastDist + pctHH200kOrMore + luxury + 
##     elevator + dock + estates
## 
## 
## Step:  AIC=71250.08
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors + 
##     pool + singlefamily + pctOwnerHH + crime_nn5 + milecoast + 
##     pctNoMortgage + pctOwnerSince2010 + pctCarCommute + pctPubCommute + 
##     pctPoverty + Halfmile_metro + metro_nn1 + sexualoffenders_Buffer + 
##     CoastDist + pctHH200kOrMore + luxury + elevator + dock + 
##     estates
## 
##                          Df       Sum of Sq              RSS   AIC
## - milecoast               1     34976861590 1556818908245101 71248
## - Halfmile_metro          1    178773872824 1556962705256336 71248
## - pctOwnerSince2010       1    376855090991 1557160786474503 71249
## - pctNoMortgage           1    474802046491 1557258733430002 71249
## - CoastDist               1    762916633576 1557546848017088 71249
## - pctPoverty              1    877484362284 1557661415745796 71250
## <none>                                      1556783931383512 71250
## - sexualoffenders_Buffer  1   1407645809325 1558191577192837 71250
## - crime_nn5               1   1410989790866 1558194921174378 71250
## - pctPubCommute           1   1809413695690 1558593345079202 71251
## - metro_nn1               1   3312886742800 1560096818126312 71254
## - pool                    1   3958649099013 1560742580482525 71255
## - parks_nn1               1   4279260338173 1561063191721685 71255
## - Age                     1   4347781940920 1561131713324432 71255
## - pctOwnerHH              1   4428324761288 1561212256144800 71256
## - pctBachelors            1   4535874751690 1561319806135202 71256
## - pctHis                  1   5260656535142 1562044587918653 71257
## - pctWhite                1   5674102784626 1562458034168137 71258
## - Bath                    1   8711401630863 1565495333014374 71263
## - pctBlack                1   8727987892154 1565511919275666 71263
## - pctCarCommute           1  15887481361670 1572671412745182 71275
## - MedHHInc                1  19151119776026 1575935051159538 71280
## - pctHH200kOrMore         1  20512870439495 1577296801823007 71282
## - Bed                     1  22434693350667 1579218624734178 71286
## - singlefamily            1  31992502942841 1588776434326353 71302
## - MedRent                 1  52232829388834 1609016760772346 71335
## - elevator                1  54925294938212 1611709226321724 71339
## - LotSize                 1  64988335959589 1621772267343100 71355
## - dock                    1  68821456703640 1625605388087152 71362
## - estates                 1 160065650805404 1716849582188916 71505
## - luxury                  1 312202401329569 1868986332713081 71728
## - ActualSqFt              1 421556392811348 1978340324194859 71877
## 
## Step:  AIC=71248.14
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors + 
##     pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage + 
##     pctOwnerSince2010 + pctCarCommute + pctPubCommute + pctPoverty + 
##     Halfmile_metro + metro_nn1 + sexualoffenders_Buffer + CoastDist + 
##     pctHH200kOrMore + luxury + elevator + dock + estates
## 
##                          Df       Sum of Sq              RSS   AIC
## - Halfmile_metro          1    171438575196 1556990346820297 71246
## - pctOwnerSince2010       1    352904557827 1557171812802928 71247
## - pctNoMortgage           1    502137747760 1557321045992861 71247
## - pctPoverty              1    896137654646 1557715045899748 71248
## - CoastDist               1   1047793082086 1557866701327187 71248
## <none>                                      1556818908245101 71248
## - sexualoffenders_Buffer  1   1391887395928 1558210795641029 71248
## - crime_nn5               1   1461460944292 1558280369189394 71249
## - pctPubCommute           1   1845581829534 1558664490074635 71249
## - metro_nn1               1   3299923530050 1560118831775151 71252
## - pool                    1   3946651824479 1560765560069580 71253
## - Age                     1   4364251192178 1561183159437280 71253
## - parks_nn1               1   4367011227828 1561185919472930 71253
## - pctOwnerHH              1   4442065582406 1561260973827508 71254
## - pctBachelors            1   4521012012513 1561339920257614 71254
## - pctHis                  1   5236458664406 1562055366909508 71255
## - pctWhite                1   5696837558700 1562515745803801 71256
## - Bath                    1   8710284937892 1565529193182993 71261
## - pctBlack                1   8711864841164 1565530773086265 71261
## - pctCarCommute           1  15910805526417 1572729713771518 71273
## - MedHHInc                1  19460549195774 1576279457440874 71279
## - pctHH200kOrMore         1  20568246267674 1577387154512774 71281
## - Bed                     1  22424636843521 1579243545088622 71284
## - singlefamily            1  32126353187000 1588945261432101 71300
## - MedRent                 1  52451061395588 1609269969640690 71333
## - elevator                1  54957559139929 1611776467385030 71337
## - LotSize                 1  65109953684523 1621928861929624 71354
## - dock                    1  68890017943056 1625708926188157 71360
## - estates                 1 160174267546179 1716993175791280 71503
## - luxury                  1 312311046910239 1869129955155340 71726
## - ActualSqFt              1 421525593454794 1978344501699895 71875
## 
## Step:  AIC=71246.43
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors + 
##     pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage + 
##     pctOwnerSince2010 + pctCarCommute + pctPubCommute + pctPoverty + 
##     metro_nn1 + sexualoffenders_Buffer + CoastDist + pctHH200kOrMore + 
##     luxury + elevator + dock + estates
## 
##                          Df       Sum of Sq              RSS   AIC
## - pctOwnerSince2010       1    378764955540 1557369111775838 71245
## - pctNoMortgage           1    485598132482 1557475944952780 71245
## - pctPoverty              1    885909887388 1557876256707686 71246
## - CoastDist               1    969031534572 1557959378354869 71246
## <none>                                      1556990346820297 71246
## - sexualoffenders_Buffer  1   1424094846201 1558414441666498 71247
## - crime_nn5               1   1462837659448 1558453184479746 71247
## - pctPubCommute           1   1726608535822 1558716955356119 71247
## - metro_nn1               1   3242295720872 1560232642541170 71250
## - pool                    1   3943045137196 1560933391957493 71251
## - Age                     1   4262142912155 1561252489732452 71252
## - pctBachelors            1   4444216874477 1561434563694774 71252
## - pctOwnerHH              1   4517585713894 1561507932534191 71252
## - parks_nn1               1   4584285963053 1561574632783350 71252
## - pctHis                  1   5066126328786 1562056473149083 71253
## - pctWhite                1   5525484579872 1562515831400169 71254
## - pctBlack                1   8633531626036 1565623878446333 71259
## - Bath                    1   8720878730640 1565711225550937 71259
## - pctCarCommute           1  15744932373084 1572735279193382 71271
## - MedHHInc                1  19468953809268 1576459300629566 71277
## - pctHH200kOrMore         1  20423674200627 1577414021020924 71279
## - Bed                     1  22335483388045 1579325830208342 71282
## - singlefamily            1  32423848819469 1589414195639766 71299
## - MedRent                 1  52321860769394 1609312207589692 71331
## - elevator                1  54920309218976 1611910656039274 71335
## - LotSize                 1  64942650068849 1621932996889146 71352
## - dock                    1  69069471452323 1626059818272620 71358
## - estates                 1 160502664435973 1717493011256270 71502
## - luxury                  1 312574197736371 1869564544556668 71725
## - ActualSqFt              1 421468947430111 1978459294250408 71874
## 
## Step:  AIC=71245.07
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors + 
##     pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage + 
##     pctCarCommute + pctPubCommute + pctPoverty + metro_nn1 + 
##     sexualoffenders_Buffer + CoastDist + pctHH200kOrMore + luxury + 
##     elevator + dock + estates
## 
##                          Df       Sum of Sq              RSS   AIC
## - pctPoverty              1    711226699202 1558080338475040 71244
## - CoastDist               1    801404496108 1558170516271946 71244
## - pctNoMortgage           1    840615219640 1558209726995478 71244
## <none>                                      1557369111775838 71245
## - sexualoffenders_Buffer  1   1395672314732 1558764784090569 71245
## - pctPubCommute           1   1448349259902 1558817461035739 71246
## - crime_nn5               1   1659792957243 1559028904733081 71246
## - metro_nn1               1   2925406784097 1560294518559934 71248
## - pool                    1   4006745954864 1561375857730702 71250
## - pctBachelors            1   4065452999328 1561434564775166 71250
## - pctOwnerHH              1   4223036412996 1561592148188833 71250
## - Age                     1   4305984334683 1561675096110520 71250
## - parks_nn1               1   4615673217101 1561984784992938 71251
## - pctHis                  1   4722748347022 1562091860122860 71251
## - pctWhite                1   5231407232715 1562600519008552 71252
## - pctBlack                1   8254834884323 1565623946660160 71257
## - Bath                    1   8774742378387 1566143854154224 71258
## - pctCarCommute           1  15381902470403 1572751014246241 71269
## - MedHHInc                1  19204811757066 1576573923532904 71275
## - pctHH200kOrMore         1  21537953892198 1578907065668036 71279
## - Bed                     1  22538786390294 1579907898166131 71281
## - singlefamily            1  32047037325886 1589416149101724 71297
## - MedRent                 1  52541564862424 1609910676638261 71330
## - elevator                1  55226329020642 1612595440796479 71335
## - LotSize                 1  64911611882944 1622280723658781 71350
## - dock                    1  69438058459428 1626807170235265 71358
## - estates                 1 160191885102324 1717560996878162 71500
## - luxury                  1 312651640114108 1870020751889945 71724
## - ActualSqFt              1 421944482172130 1979313593947967 71873
## 
## Step:  AIC=71244.27
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors + 
##     pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage + 
##     pctCarCommute + pctPubCommute + metro_nn1 + sexualoffenders_Buffer + 
##     CoastDist + pctHH200kOrMore + luxury + elevator + dock + 
##     estates
## 
##                          Df       Sum of Sq              RSS   AIC
## - CoastDist               1    537174358025 1558617512833065 71243
## - pctNoMortgage           1    672704187898 1558753042662938 71243
## <none>                                      1558080338475040 71244
## - sexualoffenders_Buffer  1   1189166804381 1559269505279421 71244
## - pctPubCommute           1   1198337218813 1559278675693853 71244
## - crime_nn5               1   1637613748592 1559717952223632 71245
## - metro_nn1               1   2561379485218 1560641717960258 71247
## - pctOwnerHH              1   3840556113843 1561920894588883 71249
## - pool                    1   3858128161825 1561938466636865 71249
## - Age                     1   4255023330357 1562335361805397 71249
## - pctHis                  1   4720073435264 1562800411910304 71250
## - pctBachelors            1   4843767871621 1562924106346661 71250
## - parks_nn1               1   5139305192697 1563219643667737 71251
## - pctWhite                1   5238910278604 1563319248753644 71251
## - pctBlack                1   8608217010056 1566688555485096 71257
## - Bath                    1   8842567737520 1566922906212560 71257
## - pctCarCommute           1  15688324178424 1573768662653464 71269
## - MedHHInc                1  20098684779764 1578179023254804 71276
## - pctHH200kOrMore         1  22510542707686 1580590881182726 71280
## - Bed                     1  22821004167456 1580901342642496 71280
## - singlefamily            1  32672633203812 1590752971678852 71297
## - MedRent                 1  52053460084419 1610133798559459 71329
## - elevator                1  54955734053303 1613036072528343 71333
## - LotSize                 1  65281712162320 1623362050637360 71350
## - dock                    1  69007494420740 1627087832895780 71356
## - estates                 1 159867325570649 1717947664045689 71499
## - luxury                  1 312312089709836 1870392428184876 71722
## - ActualSqFt              1 423396975538696 1981477314013736 71874
## 
## Step:  AIC=71243.18
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors + 
##     pool + singlefamily + pctOwnerHH + crime_nn5 + pctNoMortgage + 
##     pctCarCommute + pctPubCommute + metro_nn1 + sexualoffenders_Buffer + 
##     pctHH200kOrMore + luxury + elevator + dock + estates
## 
##                          Df       Sum of Sq              RSS   AIC
## - pctNoMortgage           1    831913368312 1559449426201377 71243
## - pctPubCommute           1   1166005710260 1559783518543325 71243
## <none>                                      1558617512833065 71243
## - sexualoffenders_Buffer  1   1345449996428 1559962962829494 71243
## - crime_nn5               1   1686724412312 1560304237245377 71244
## - metro_nn1               1   2028394469436 1560645907302502 71245
## - pctOwnerHH              1   3304221325672 1561921734158737 71247
## - pool                    1   3800932506688 1562418445339753 71248
## - Age                     1   4194005638651 1562811518471716 71248
## - pctHis                  1   4223563088938 1562841075922003 71248
## - pctWhite                1   4722087591294 1563339600424360 71249
## - parks_nn1               1   5012769143588 1563630281976654 71250
## - pctBachelors            1   5303478886028 1563920991719093 71250
## - pctBlack                1   8680645196341 1567298158029406 71256
## - Bath                    1   8982099729198 1567599612562264 71256
## - pctCarCommute           1  15249361774748 1573866874607814 71267
## - MedHHInc                1  19675114444262 1578292627277328 71274
## - pctHH200kOrMore         1  21973399333651 1580590912166716 71278
## - Bed                     1  23190533180545 1581808046013610 71280
## - singlefamily            1  32156628247600 1590774141080665 71295
## - MedRent                 1  51706646403425 1610324159236490 71327
## - elevator                1  54755814703600 1613373327536665 71332
## - LotSize                 1  65499920798491 1624117433631556 71349
## - dock                    1  69516069599367 1628133582432432 71356
## - estates                 1 159433161957169 1718050674790234 71497
## - luxury                  1 312035675887201 1870653188720266 71720
## - ActualSqFt              1 425764199164783 1984381711997848 71875
## 
## Step:  AIC=71242.58
## SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + parks_nn1 + 
##     MedRent + MedHHInc + pctWhite + pctBlack + pctHis + pctBachelors + 
##     pool + singlefamily + pctOwnerHH + crime_nn5 + pctCarCommute + 
##     pctPubCommute + metro_nn1 + sexualoffenders_Buffer + pctHH200kOrMore + 
##     luxury + elevator + dock + estates
## 
##                          Df       Sum of Sq              RSS   AIC
## <none>                                      1559449426201376 71243
## - sexualoffenders_Buffer  1   1253798137015 1560703224338392 71243
## - pctPubCommute           1   1365841858245 1560815268059622 71243
## - crime_nn5               1   1973636174701 1561423062376078 71244
## - metro_nn1               1   1979348929748 1561428775131124 71244
## - pctHis                  1   3575637622773 1563025063824150 71247
## - pool                    1   3711001497375 1563160427698752 71247
## - Age                     1   4337709941065 1563787136142441 71248
## - pctWhite                1   4780517806188 1564229944007564 71249
## - parks_nn1               1   5424732183716 1564874158385092 71250
## - pctBachelors            1   7052364537540 1566501790738916 71252
## - pctBlack                1   8030104388766 1567479530590143 71254
## - pctOwnerHH              1   8236182872165 1567685609073542 71254
## - Bath                    1   9177413308810 1568626839510186 71256
## - pctCarCommute           1  15864869875371 1575314296076748 71267
## - pctHH200kOrMore         1  22449536070059 1581898962271436 71278
## - Bed                     1  23546167586048 1582995593787425 71280
## - MedHHInc                1  23673147512708 1583122573714084 71280
## - singlefamily            1  31324945385453 1590774371586830 71293
## - MedRent                 1  50970218792512 1610419644993889 71325
## - elevator                1  54952098698386 1614401524899763 71332
## - LotSize                 1  66453917166717 1625903343368093 71350
## - dock                    1  69300151765953 1628749577967330 71355
## - estates                 1 159019305847057 1718468732048433 71496
## - luxury                  1 311803649136820 1871253075338196 71719
## - ActualSqFt              1 425711244001338 1985160670202715 71874
## 
## Call:
## lm(formula = SalePrice ~ LotSize + Bed + Bath + Age + ActualSqFt + 
##     parks_nn1 + MedRent + MedHHInc + pctWhite + pctBlack + pctHis + 
##     pctBachelors + pool + singlefamily + pctOwnerHH + crime_nn5 + 
##     pctCarCommute + pctPubCommute + metro_nn1 + sexualoffenders_Buffer + 
##     pctHH200kOrMore + luxury + elevator + dock + estates, data = st_drop_geometry(MiamiProperties) %>% 
##     dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, 
##         parks_nn1, MedRent, MedHHInc, pctWhite, pctBlack, pctHis, 
##         pctBlackorHis, pctBachelors, pool, singlefamily, pctOwnerHH, 
##         pctRenterHH, crime_nn5, milecoast, pctNoMortgage, pctOwnerSince2010, 
##         pctCarCommute, pctPubCommute, pctPoverty, Halfmile_metro, 
##         metro_nn1, sexualoffenders_Buffer, milecoast, CoastDist, 
##         pctHH200kOrMore, luxury, elevator, dock, estates))
## 
## Coefficients:
##            (Intercept)                 LotSize                     Bed  
##             276011.040                  53.187             -160779.522  
##                   Bath                     Age              ActualSqFt  
##             105986.983               -1732.149                 569.133  
##              parks_nn1                 MedRent                MedHHInc  
##                 61.026                 433.723                  -7.685  
##               pctWhite                pctBlack                  pctHis  
##               8704.458               16943.994                6924.188  
##           pctBachelors                poolPool         singlefamilyYes  
##             -99693.043             -123151.714             -289647.271  
##             pctOwnerHH               crime_nn5           pctCarCommute  
##               5972.884                 -25.664              -19528.543  
##          pctPubCommute               metro_nn1  sexualoffenders_Buffer  
##              -8940.123                   5.056              -31486.780  
##        pctHH200kOrMore               luxuryYes             elevatorYes  
##              75995.341             3887086.047             1654400.949  
##            dockNo Dock              estatesYes  
##            -978048.563             3274924.331

The variables with crime factors were not part of the output from the stepwise backward function. However, we made a judgment call that crime (proximity or density of sexual offenders as a proxy) is important factor in the predicting the price of a home. So we decided to try adding “crime_nn5”, which indicates the average distances of the 5 nearest sexual offenders, and “sexualoffenders_Buffer”, which indicates the number of sexual offenders in a half mile buffer of a home.

>>>>>>> 30a48337056764284380a80a0436d5403662727e
Dependent variable:
SalePrice
(1) (2) (3) (4) (5) (6)
LotSize 48.010*** 82.986*** 69.292*** 53.669*** 53.740*** 47.937***
(5.175) (5.127) (5.599) (5.054) (5.065) (5.112)
Bed -177,431.500*** -200,237.200*** -311,230.100*** -160,559.900*** -155,013.500*** -162,279.300***
(26,364.890) (27,312.320) (29,612.850) (25,680.120) (25,599.310) (26,045.800)
Bath 130,667.000*** 121,706.900*** 149,004.000*** 106,146.400*** 104,343.500*** 105,904.800***
(27,776.340) (28,997.630) (31,812.030) (27,098.590) (27,110.510) (27,458.130)
Age -1,692.840** -1,544.741** -3,078.212*** -1,675.232*** -1,706.347*** -1,724.194***
(663.551) (685.881) (757.994) (643.991) (642.513) (655.959)
ActualSqFt 601.234*** 642.300*** 907.582*** 566.981*** 566.367*** 567.197***
(21.615) (22.281) (20.603) (21.352) (21.341) (21.645)
parks_nn1 51.400** 64.165*** 57.585*** 67.923*** 112.126***
(21.364) (24.291) (20.369) (19.549) (19.696)
MedRent 493.947*** 521.641*** 456.687*** 441.525***
(52.471) (57.422) (45.668) (45.275)
MedHHInc -7.827*** -9.410*** -7.539*** -8.645*** -4.584***
(1.451) (1.528) (1.251) (1.294) (1.164)
pctWhite 12,467.680*** 6,967.233**
(4,065.926) (2,886.996)
pctBlack 22,126.360*** 13,438.520***
(6,677.178) (3,927.429)
pctHis 8,266.271** 5,494.729** 605.473
(3,931.581) (2,424.217) (751.950)
pctBlackorHis 1,473.638 4,242.732*
(3,050.742) (2,445.037)
pctBachelors -137,797.300*** -139,740.100*** -92,711.170*** -108,829.400*** -72,509.500***
(34,187.930) (37,923.630) (29,191.060) (28,608.070) (22,955.770)
elevatorYes 1,459,061.000*** 1,976,763.000*** 1,659,684.000*** 1,662,770.000*** 1,581,587.000***
(177,745.200) (184,267.700) (172,869.400) (173,020.000) (174,485.600)
dockNo Dock -938,449.800*** -1,024,027.000*** -977,745.300*** -929,487.400*** -926,879.800***
(90,502.210) (94,386.810) (89,406.460) (88,570.450) (89,229.720)
BeachViewYes -87,089.150 196,961.700 244,590.800
(151,595.400) (172,970.800) (161,821.600)
poolPool 1,660.539 -279,862.100*** -305,848.900*** -121,022.900** -132,705.700*** -115,987.400**
(47,763.740) (52,041.330) (57,455.040) (49,349.150) (49,158.020) (49,838.530)
pctOwnerHH 4,370.392* 9,791.028*** 5,523.733*** 7,565.442*** 6,574.484***
(2,420.360) (2,839.093) (1,850.178) (1,407.104) (1,345.469)
pctRenterHH
crime_nn5 -31.101** 3.317 -14.777
(14.112) (16.844) (12.749)
milecoastYes -45,641.080 -61,817.680
(64,424.120) (73,425.550)
pctNoMortgage -2,867.908 2,464.264 11,267.500
(9,708.868) (10,851.580) (8,546.168)
pctOwnerSince2010 -5,777.683 4,609.115
(4,458.552) (5,023.466)
pctCarCommute -23,557.440*** -23,196.010*** -14,759.380*** -14,440.820*** -14,239.530***
(3,998.039) (4,590.258) (2,845.830) (2,717.118) (2,390.630)
pctPubCommute -11,823.920* -7,224.647
(6,115.165) (7,335.817)
pctPoverty 6,090.940** 157.046
(2,810.809) (3,425.889)
Halfmile_metroNon_Halfmile_metro -42,914.660 -78,563.660 22,804.340 56,329.800
(55,350.630) (66,698.300) (51,315.830) (51,761.810)
metro_nn1 11.786***
(4.101)
sexualoffenders_Buffer -20,107.980 -24,402.190
(25,869.850) (21,208.070)
CoastDist <<<<<<< HEAD 8,803.452 ======= 8,803.453 >>>>>>> 30a48337056764284380a80a0436d5403662727e -28,645.250
(25,191.570) (32,654.660)
pctHH200kOrMore 54,814.180*** 59,847.630*** 68,705.370*** 73,923.860*** 43,001.600***
(13,328.200) (15,142.700) (11,418.110) (12,070.070) (9,838.541)
luxuryYes 3,943,744.000*** 3,914,686.000*** 3,884,152.000*** 3,899,354.000*** 3,956,162.000***
(176,617.800) (181,844.500) (170,471.500) (170,702.800) (172,911.900)
singlefamilyYes -243,110.700*** -548,427.800*** -304,755.200*** -293,070.000*** -322,687.400***
(34,594.600) (47,211.070) (39,753.950) (38,447.110) (39,797.510)
estatesYes 3,432,949.000*** 3,266,344.000*** 3,324,030.000*** 3,465,026.000***
(203,207.400) (200,976.900) (201,159.200) (202,098.500)
Constant 504,740.800*** 1,732,718.000*** -1,199,084.000** -2,734.609 742,843.400*** 1,353,779.000***
(112,369.400) (404,062.900) (598,149.600) (331,626.900) (256,129.200) (204,321.500)
Observations 2,627 2,626 2,626 2,626 2,626 2,627
R2 0.845 0.837 0.803 0.857 0.857 0.852
Adjusted R2 0.844 0.835 0.801 0.856 0.856 0.851
Residual Std. Error 808,061.300 (df = 2614) 829,004.200 (df = 2599) 910,113.100 (df = 2598) 774,919.900 (df = 2603) 775,610.700 (df = 2603) 789,882.400 (df = 2606)
F Statistic 1,187.840*** (df = 12; 2614) 512.839*** (df = 26; 2599) 393.389*** (df = 27; 2598) 710.520*** (df = 22; 2603) 709.044*** (df = 22; 2603) 752.372*** (df = 20; 2606)
Note: p<0.1; p<0.05; p<0.01
<<<<<<< HEAD

Comparing the models we have created in the table above, we have decided that reg6, despite having a marginally smaller p-value than reg4, has the variables that we believe are critical for price prediction.

=======

Comparing the models we have created in the table above, we have decided that reg6, despite having a marginally smaller p-value, has the variables that we believe are critical for price prediction.

>>>>>>> 30a48337056764284380a80a0436d5403662727e

Final Model

Fitting linear model: SalePrice ~ . (continued below)
  Estimate Std. Error t value
(Intercept) 1353779 204321 6.626
LotSize 47.94 5.112 9.377
Bed -162279 26046 -6.231
Bath 105905 27458 3.857
Age -1724 656 -2.629
ActualSqFt 567.2 21.64 26.2
parks_nn1 112.1 19.7 5.693
MedHHInc -4.584 1.164 -3.94
pctHis 605.5 752 0.8052
pctBachelors -72510 22956 -3.159
poolPool -115987 49839 -2.327
singlefamilyYes -322687 39798 -8.108
pctOwnerHH 6574 1345 4.886
pctCarCommute -14240 2391 -5.956
pctHH200kOrMore 43002 9839 4.371
Halfmile_metroNon_Halfmile_metro 56330 51762 1.088
luxuryYes 3956162 172912 22.88
elevatorYes 1581587 174486 9.064
dockNo Dock -926880 89230 -10.39
sexualoffenders_Buffer -24402 21208 -1.151
estatesYes 3465026 202099 17.15
<<<<<<< HEAD =======
>>>>>>> 30a48337056764284380a80a0436d5403662727e <<<<<<< HEAD ======= >>>>>>> 30a48337056764284380a80a0436d5403662727e <<<<<<< HEAD ======= >>>>>>> 30a48337056764284380a80a0436d5403662727e <<<<<<< HEAD ======= >>>>>>> 30a48337056764284380a80a0436d5403662727e <<<<<<< HEAD ======= >>>>>>> 30a48337056764284380a80a0436d5403662727e <<<<<<< HEAD
  Pr(>|t|)
(Intercept)4.183e-11
LotSize 1.428e-20
Bed 5.403e-100.00000000004183
LotSize 0.00000000000000000001428
Bed 0.0000000005403
Bath 0.0001176
Age 0.008626
ActualSqFt1.516e-134
parks_nn1 1.39e-08
MedHHInc 8.365e-050.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001516
parks_nn1 0.0000000139
MedHHInc 0.00008365
pctHis 0.4208
pctBachelors 0.001603
poolPool 0.02003
singlefamilyYes7.818e-16
pctOwnerHH 1.089e-06
pctCarCommute 2.926e-09
pctHH200kOrMore 1.287e-050.0000000000000007818
pctOwnerHH 0.000001089
pctCarCommute 0.000000002926
pctHH200kOrMore 0.00001287
Halfmile_metroNon_Halfmile_metro 0.2766
luxuryYes9.911e-106
elevatorYes 2.402e-19
dockNo Dock 8.547e-250.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000009911
elevatorYes 0.0000000000000000002402
dockNo Dock 0.0000000000000000000000008547
sexualoffenders_Buffer 0.25
estatesYes1.613e-62

4. Results

To test our selected model, we randomly divided our master dataset, MiamiProperties, into separate training (60%) and test (40%) sets. The table shows the results of model.

# set random seed
set.seed(31337)

# get index for training sample
inTrain <- caret::createDataPartition(
  y = paste(MiamiProperties$Neighborhood),
  p = .60, list = FALSE)
# split data into training and test
Miami.training <- MiamiProperties[inTrain,] 
Miami.test     <- MiamiProperties[-inTrain,]  

#Regression
reg_final_train <- lm(SalePrice ~ ., data = st_drop_geometry(Miami.training) %>% 
                  dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool, 
                                singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))


stargazer(reg_final_train, type = "html", main="Training Set Model Summary Results", caption="Model Summary Results for Training Dataset")
Dependent variable:
SalePrice
LotSize 28.105***
(6.645)
Bed -190,599.700***
(32,140.920)
Bath 35,110.370
(34,273.190)
Age -1,976.730**
(810.014)
ActualSqFt 657.844***
(25.999)
parks_nn1 131.932***
(24.500)
MedHHInc -4.868***
(1.460)
pctHis 329.402
(932.747)
pctBachelors -39,660.310
(27,994.360)
poolPool -98,341.250
(63,949.590)
singlefamilyYes -335,280.300***
(49,361.380)
pctOwnerHH 8,522.047***
(1,664.231)
pctCarCommute -19,406.760***
(2,940.341)
pctHH200kOrMore 37,930.350***
(12,245.920)
Halfmile_metroNon_Halfmile_metro 88,067.790
(63,993.000)
luxuryYes 3,616,042.000***
(196,018.400)
elevatorYes 1,154,349.000***
(204,790.700)
dockNo Dock -919,700.000***
(115,585.000)
sexualoffenders_Buffer -26,241.090
(26,070.220)
estatesYes 3,078,753.000***
(239,871.600)
Constant 1,818,951.000***
(251,193.600)
Observations 1,608
R2 0.868
Adjusted R2 0.866
Residual Std. Error 766,335.500 (df = 1587)
F Statistic 521.846*** (df = 20; 1587)
Note: p<0.1; p<0.05; p<0.01
Training Set Model Summary Results
Model Summary Results for Training Dataset

As an indicator of how well our model is performing, the values below show the mean average error of our model on the training and test set. We do think these error values are high. However, as discussed below, our model has issues predicting higher priced homes, which skews our mean average error. Therefore, we don’t think the mean average error is the best single indicator to validate the fit of our model.

## predicting on new data
reg_final_predict <- predict(reg_final_train, newdata = Miami.test)

## Mean Square Error train and test
rmse.train <- caret::MAE(predict(reg_final_train), Miami.training$SalePrice)
rmse.test  <- caret::MAE(reg_final_predict, Miami.test$SalePrice)

cat("Train MAE: ", as.integer(rmse.train), " \n","Test MAE: ", as.integer(rmse.test))
## Train MAE:  412089  
##  Test MAE:  372278

The first plot below shows predicted home prices as a function of observed home prices in Miami for all homes. The bottom figure shows predicted home prices as a function of observed home prices in Miami divided between the test set and the training set. The orange line represents a scenario where the predicted home price perfectly matches the observed home price. The green line represents our model’s prediction results. As the plots show, our model is systematically under predicting home prices (e.g., a home that is actually priced at $1 million is predicted to be approximately 600,000 dollars. The green line stays close to the perfect prediction line (orange) until homes with actual prices around 1 million dollars. At higher prices, our model performs worse.

#Plotting accuracy metrics
preds.train <- data.frame(pred   = predict(reg_final_train),
                          actual = Miami.training$SalePrice,
                          source = "training data")
preds.test  <- data.frame(pred   = reg_final_predict,
                          actual = Miami.test$SalePrice,
                          source = "testing data")
preds <- rbind(preds.train, preds.test)

grid.arrange(nrow = 2,
ggplot(preds, aes(x = actual, y = pred)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") +
  geom_abline(color = "orange") +
  coord_equal() +
  theme_bw() +
  labs(title = "Comparing predictions to actual values for all homes",
       caption="Predicted Home Prices as a Function of Observed Home Prices",
       x = "Actual Value",
       y = "Predicted Value") +
  theme(),
ggplot(preds, aes(x = actual, y = pred, color = source)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") +
  geom_abline(color = "orange") +
  coord_equal() +
  theme_bw() +
  facet_wrap(~source, ncol = 2) +
  labs(title = "Comparing predictions to actual values for the test set and the training set",
       caption="Predicted Home Prices as a Function of Observed Home Prices for the Test Set and the Training Set",
       x = "Actual Value",
       y = "Predicted Value") +
  theme(
    legend.position = "none"))

The table below summarizes the mean and standard deviation MAE for our model. In the calculation of RMSE, errors are squared before they are averaged, which gives a relatively higher weight to larger errors. Because our model predicts poorly for highly priced houses, we are not surprised to see RMSE values that are greater than MAE values. In this case, MAE is a better metric to measure the accuracy of our model.

Mean and Standard Deviation for the Mode
RMSE Rsquared MAE RMSESD RsquaredSD MAESD
705970.4 0.8365392 395359.3 445205.2 0.1117048 144068.9

The graphs below shows the results from the 100 folds. THe statistics cluster together which indicates that the model would be generalizable to new data. The relatively normal histogram also confirms that our model would generalize to new data with some errors.

Overall, the mean absolute error for the test set is around 400,000 dollars and the mean absolute percentage error is over 100% (see below table). This indicates that our model does not perform well overall.

Mean Absolute Error and Mean Absolute Percentage Error for the Test Set
Mean_Absolute_Error Mean_Absolute_Percentage_Error
402650 1.433111

The figure below shows a map of residuals for test set. Errors are both positive and negative, indicating that our model is both over and under predicting.

The figures below shows the plot of spatial lag in errors for the test set. Spatial lag tells the average sale price of the 20 nearest neighbors. The plot of spatial lag of error tells us that as home price error increases, the errors or nearby home prices also increase. This correlation tells us that there is a spatial relationship that is unaccounted for in the model. Similarly, the Moran’s I plot provides a secondary visual confirmation that spatial autocorrelation remains.

The table below shows the results of our attempt to account for spatial autocorrelation using a neighborhood feature. However, adding in the neighborhood effect did not greatly improve the results, likely due to a few very expensive homes that our model poorly predicts.

Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 402650.0 1.433111
Neighborhood Effects 401716.9 1.018834

The map shows predicted sale prices for all homes. For the most part, the model predicts the most expensive home prices in Miami Beach and along the coast of the mainland.

secret_data <- MiamiPropertiesPred
secret_preds <- predict(reg6, newdata = secret_data)
output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = "Miami Heat Predictors")
#write.csv(output_preds, "Miami Heat Predictors.csv")

known_preds <- predict(reg6, newdata = MiamiProperties)
known_preds_geom<-st_sf(prediction = known_preds, Folio = MiamiProperties$Folio, geometry= MiamiProperties$geometry)

secret_preds_geom <- st_sf(prediction = secret_preds, Folio = secret_data$Folio, geometry= secret_data$geometry)

allpreds<-rbind(known_preds_geom, secret_preds_geom)

ggplot()+
  geom_sf(data = all_nhoods_MB, fill = "grey40") +
  geom_sf(data = allpreds, aes(colour = q5(prediction)),
          show.legend = "point", size = 1) +
  scale_colour_manual(values = palette5,
                      labels=qBr(allpreds,"prediction"),
                      name="Quintile\nBreaks") +
  labs(title="Predicted Home Prices",
       caption="Predicted Home Prices for All Homes") +
  mapTheme()

The figure below shows the mean absolute percentage error by neighborhood. This map shoes that some neighborhoods have a higher mean absolute percentage error (depicted in orange) than the majority of the neighborhoods (in green).

#map MAPE by neighborhoods
st_drop_geometry(bothRegressions) %>%
  group_by(Regression, Neighborhood) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(all_nhoods) %>%
  st_sf() %>%
  ggplot() + 
  geom_sf(aes(fill = mean.MAPE)) +
  geom_sf(data = bothRegressions, colour = "black", size = .5) +
  facet_wrap(~Regression) +
  scale_fill_gradient(low = palette5[1], high = palette5[5],
                      name = "MAPE") +
  labs(title = "Mean test set MAPE by neighborhood",
       caption="Mean Absolute Percentage Error by Neighborhood for both the Baseline Regression and Neighborhood Effects") +
  mapTheme()

As seen in the figure below, there are two neighborhoods (Coral Gate and Flora Park) where the mean absolute percentage error is very high when the mean price is very low. In addition, there are two neighborhoods (Belle Island and San Marco Island) where the mean price is very high while the mean absolute percentage error is very low. For most neighborhoods, the mean price is less than 3,000,000 and the mean absolute percentage error is less than ~4. Overall, the mean absolute percentage errors are higher than is desired.

#Plot MAPE by neighborhoods
st_drop_geometry(Miami.test) %>%
  group_by(Neighborhood) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T),
            mean.Price=mean(SalePrice, na.rm=T)) %>%
  ggplot(aes(mean.Price, mean.MAPE)) +
  geom_point(color="blue")+
  geom_text(aes(label=ifelse(mean.MAPE>5, Neighborhood,"")), hjust=0.2, vjust=-0.3)+
  geom_text(aes(label=ifelse(mean.Price>3000000, Neighborhood, "")),hjust=0.5, vjust=0)+
  labs(title = "Mean Absolute Percentage Error as a Function of Mean Price by Neighborhood",
       caption="Mean Absolute Percentage Error and Mean Price by Neighborhood") +
  plotTheme()

Generalizability by Race and Income

As shown in the tables below, this model is not generalizable by race or income. These tables provide the Mean Absolute Percent Error for different income groups and homes located in majority Hispanic census tracts. For homes in majority Hispanic census tracts, the model performs better for majority non-hispanic without neighborhood effects. With respect to income groups, our model performs better for homes located in high income census tracts than for homes located in low income tracts.

#Race Context and Income Context (from book)
tracts18 <- 
  tracts18%>%
  mutate(raceContext = ifelse(pctHis > 50, "Majority Hispanic", "Majority Non-Hispanic"),
         incomeContext = ifelse(MedInc > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
             ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = raceContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
               labs(title = "Race Context") +
               mapTheme() + theme(legend.position="bottom"), 
             ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = incomeContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
               labs(title = "Income Context") +
               mapTheme() + 
               theme(legend.position="bottom"))

st_join(st_centroid(bothRegressions), tracts18) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Table X. Test set MAPE by neighborhood racial context") %>%
  kable_styling("striped", full_width = F)
Table X. Test set MAPE by neighborhood racial context
Regression Majority Hispanic Majority Non-Hispanic
Baseline Regression 178% 110%
Neighborhood Effects 84% 119%
st_join(st_centroid(bothRegressions), tracts18) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Table XI. Test set MAPE by neighborhood income context") %>%
  kable_styling("striped", full_width = F)
======= >>>>>>> 30a48337056764284380a80a0436d5403662727e
Table XI. Test set MAPE by neighborhood income context
Regression High Income Low Income
Baseline Regression 44% 180%
Neighborhood Effects 35% 126% 0.00000000000000000000000000000000000000000000000000000000000001613
<<<<<<< HEAD

5. Discussion

Overall, the indicators used to evaluate the fit of the model (e.g., MAE) show that our model is not predicting well and is therefore not effective for use. However, additional evaluations show that the model performs fairly well for all homes less than 10 million dollars. We were unable to effectively engineer features to correctly model prices for very expensive homes. That being said, some of the more interesting variables included in the model were percent of the population with a bachelors degree, presence of an elevator or a dock, and whether the home is single family or multi-family.

The model was able to predict 85% of the variation in price, and most important features were internal property features, including lot size, actual square footage, whether the home is zoned for single family, and whether the home had an elevator or a dock. Generally, we think that our model’s inability to predict expensive homes is the largest source of the error. This poor performance skews our overall MAE. According to the mapped results, we were unable to effectively account for the spatial variation in prices. The model predicted poorly in Miami Beach and along the mainland coast, which is where many of the expensive houses are located.

6. Conclusion

We would recommend that Zillow make further improvements before employing the model for official use. The model has large errors and does not do a good job of predicting home prices. In order to improve this model, we suggest further exploring features that would better account for the spatial processes at play. The goal of adding these features would be to account for spatial autocorrelation.

=======

4. Results

-Split the ‘toPredict’ == 0 into a separate training and test set. -Provide a polished table of your (training set) lm summary results (coefficients, R2 etc). -Provide a polished table of mean absolute error and MAPE for a single test set. Check out the “kable” function for markdown to create nice tables. -Provide the results of your cross-validation tests. This includes mean and standard deviation MAE. Do 100 folds and plot your cross-validation MAE as a histogram. Is your model generalizable to new data? -Plot predicted prices as a function of observed prices -Provide a map of your residuals for your test set. Include a Moran’s I test and a plot of the spatial lag in errors. 5 -Provide a map of your predicted values for where ‘toPredict’ is both 0 and 1. -Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood. -Provide a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood. -Using tidycensus, split your city in to two groups (perhaps by race or income) and test your model’s generalizability. Is your model generalizable?

To test our selected model, we randomly divided our master dataset, MiamiProperties, into separate training (60%) and test (40%) sets. Table 2

# set random seed
set.seed(31337)

# get index for training sample
inTrain <- caret::createDataPartition(
  y = paste(MiamiProperties$Neighborhood),
  p = .60, list = FALSE)
# split data into training and test
Miami.training <- MiamiProperties[inTrain,] 
Miami.test     <- MiamiProperties[-inTrain,]  

anyNA(MiamiProperties$estates)

[1] FALSE

#Regression
reg_final_train <- lm(SalePrice ~ ., data = st_drop_geometry(Miami.training) %>% 
                  dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool, 
                                singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))

stargazer(reg_final_train, type = "html", main="Training Set Model Summary Results", caption="Table 2. Model Summary Results for Training Dataset")
Dependent variable:
SalePrice
LotSize 28.105***
(6.645)
Bed -190,599.700***
(32,140.920)
Bath 35,110.370
(34,273.190)
Age -1,976.730**
(810.014)
ActualSqFt 657.844***
(25.999)
parks_nn1 131.932***
(24.500)
MedHHInc -4.868***
(1.460)
pctHis 329.402
(932.747)
pctBachelors -39,660.310
(27,994.360)
poolPool -98,341.250
(63,949.590)
singlefamilyYes -335,280.300***
(49,361.380)
pctOwnerHH 8,522.047***
(1,664.231)
pctCarCommute -19,406.760***
(2,940.341)
pctHH200kOrMore 37,930.350***
(12,245.920)
Halfmile_metroNon_Halfmile_metro 88,067.790
(63,993.000)
luxuryYes 3,616,042.000***
(196,018.400)
elevatorYes 1,154,349.000***
(204,790.700)
dockNo Dock -919,700.000***
(115,585.000)
sexualoffenders_Buffer -26,241.090
(26,070.220)
estatesYes 3,078,753.000***
(239,871.600)
Constant 1,818,951.000***
(251,193.600)
Observations 1,608
R2 0.868
Adjusted R2 0.866
Residual Std. Error 766,335.500 (df = 1587)
F Statistic 521.846*** (df = 20; 1587)
Note: p<0.1; p<0.05; p<0.01
Training Set Model Summary Results
Table 2. Model Summary Results for Training Dataset
## predicting on new data
reg_final_predict <- predict(reg_final_train, newdata = Miami.test)


## Mean Square Error train and test
rmse.train <- caret::MAE(predict(reg_final_train), Miami.training$SalePrice)
rmse.test  <- caret::MAE(reg_final_predict, Miami.test$SalePrice)

cat("Train MAE: ", as.integer(rmse.train), " \n","Test MAE: ", as.integer(rmse.test))
## Train MAE:  412089  
##  Test MAE:  372278
#Plotting accuracy metrics
preds.train <- data.frame(pred   = predict(reg_final_train),
                          actual = Miami.training$SalePrice,
                          source = "training data")
preds.test  <- data.frame(pred   = reg_final_predict,
                          actual = Miami.test$SalePrice,
                          source = "testing data")
preds <- rbind(preds.train, preds.test)

ggplot(preds, aes(x = pred, y = actual, color = source)) +
  geom_point() +
  geom_smooth(method = "lm", color = "green") +
  geom_abline(color = "orange") +
  coord_equal() +
  theme_bw() +
  facet_wrap(~source, ncol = 2) +
  labs(title = "Comparing predictions to actual values",
       x = "Predicted Value",
       y = "Actual Value") +
  theme(
    legend.position = "none"
  )

##### PREDICTION #####
secret_data <- MiamiPropertiesPred
secret_preds <- predict(reg6, newdata = secret_data)
output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = "Miami Heat Predictors")
write.csv(output_preds, "Miami Heat Predictors.csv")
########### CROSS- VALIDATION ##########

#Generalizability - cross validation
fitControl <- trainControl(method = "cv", number = 100, savePredictions = TRUE)
set.seed(825)

reg.cv <- 
  train(SalePrice ~ ., data = st_drop_geometry(MiamiProperties) %>% 
          dplyr::select(SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool, 
                        singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv
## Linear Regression 
## 
## 2627 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 2601, 2600, 2602, 2600, 2601, 2601, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   705970.4  0.8365392  395359.3
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv$resample
##          RMSE  Rsquared      MAE Resample
## 1    773505.1 0.9765727 430198.9  Fold001
## 2   1336408.5 0.9180646 606279.2  Fold002
## 3    671136.5 0.9310164 393067.8  Fold003
## 4    672522.1 0.7739194 461455.7  Fold004
## 5    689466.6 0.7679178 400598.4  Fold005
## 6    948758.9 0.9745954 402110.2  Fold006
## 7    359388.1 0.6679438 285979.5  Fold007
## 8    312685.9 0.8556234 259063.1  Fold008
## 9    928025.0 0.7386786 474204.7  Fold009
## 10  1898782.1 0.9621317 885762.2  Fold010
## 11   546840.6 0.9500958 377649.4  Fold011
## 12  1079773.8 0.8012237 470954.1  Fold012
## 13   450627.0 0.7802116 291624.2  Fold013
## 14   396249.0 0.8227055 293931.6  Fold014
## 15   493257.0 0.7859733 336333.4  Fold015
## 16   350777.0 0.7351069 217379.3  Fold016
## 17  1126474.4 0.7738343 596149.0  Fold017
## 18  1035253.3 0.7841888 518705.5  Fold018
## 19   374594.2 0.7749765 300680.4  Fold019
## 20   435979.8 0.7241656 337576.9  Fold020
## 21   781308.5 0.8780393 326649.6  Fold021
## 22   623979.9 0.6819755 413006.7  Fold022
## 23  1504376.2 0.4682456 699978.0  Fold023
## 24   458731.0 0.8120372 383649.0  Fold024
## 25   470262.3 0.9876497 325502.6  Fold025
## 26   868910.2 0.8633297 403368.2  Fold026
## 27  1232477.1 0.8114825 485249.3  Fold027
## 28   304735.0 0.8947379 253863.9  Fold028
## 29   393765.0 0.9114925 294384.3  Fold029
## 30   372335.7 0.8670785 274610.8  Fold030
## 31   390110.3 0.8353459 310797.0  Fold031
## 32  1013279.3 0.8741693 514069.9  Fold032
## 33   463564.3 0.8881520 368094.9  Fold033
## 34   567097.3 0.8572772 341291.2  Fold034
## 35   355287.4 0.5290788 266347.2  Fold035
## 36   667973.2 0.7645775 372203.0  Fold036
## 37  2794545.1 0.9073163 838659.8  Fold037
## 38  1168442.7 0.9336798 509841.3  Fold038
## 39   808275.0 0.7841398 436260.4  Fold039
## 40   521672.8 0.7042793 320989.6  Fold040
## 41   595353.3 0.9455842 388787.8  Fold041
## 42  1291256.7 0.9269569 608567.2  Fold042
## 43   311471.9 0.9158634 231478.8  Fold043
## 44   310502.9 0.7882913 239804.7  Fold044
## 45   533492.1 0.8354165 374576.4  Fold045
## 46   558253.3 0.9696126 405957.4  Fold046
## 47   407164.8 0.7050802 317756.0  Fold047
## 48   410460.3 0.9642930 332013.8  Fold048
## 49   887922.1 0.8231950 472054.4  Fold049
## 50   672804.3 0.6192428 353980.6  Fold050
## 51   522826.8 0.7537659 382985.8  Fold051
## 52   473851.1 0.6159535 339209.6  Fold052
## 53   642771.4 0.9557383 453192.7  Fold053
## 54   564022.9 0.8035761 407895.6  Fold054
## 55  1069713.7 0.8456283 604164.9  Fold055
## 56   367785.7 0.9909964 261109.3  Fold056
## 57   479899.9 0.9405142 304032.3  Fold057
## 58   255378.8 0.8288930 196113.9  Fold058
## 59  1044710.7 0.9295741 574472.1  Fold059
## 60   317281.1 0.7540499 247147.4  Fold060
## 61   367356.2 0.8410152 302212.4  Fold061
## 62   709895.2 0.9373799 422890.5  Fold062
## 63  1123032.0 0.8613488 424309.3  Fold063
## 64   734524.8 0.7659228 448788.5  Fold064
## 65   299213.5 0.8635155 237928.8  Fold065
## 66   334058.9 0.7258302 237451.4  Fold066
## 67   989911.1 0.9924853 421428.9  Fold067
## 68   542856.4 0.9900273 336286.1  Fold068
## 69   984634.9 0.9545788 517953.0  Fold069
## 70   809832.1 0.9761899 355268.4  Fold070
## 71   287613.0 0.8020381 230585.3  Fold071
## 72   603176.2 0.9160568 345114.9  Fold072
## 73   404871.7 0.8523867 314251.1  Fold073
## 74   841628.6 0.5399965 401023.9  Fold074
## 75   837971.4 0.9451447 527560.9  Fold075
## 76   838734.0 0.7042508 490796.2  Fold076
## 77   664316.9 0.8791849 335175.8  Fold077
## 78   347920.2 0.9263979 255828.4  Fold078
## 79   388867.1 0.7643258 265600.3  Fold079
## 80   567665.1 0.9610869 418659.4  Fold080
## 81   683677.2 0.9800604 398450.1  Fold081
## 82   736524.5 0.6317219 459411.3  Fold082
## 83   346396.2 0.8541700 269203.6  Fold083
## 84   605961.9 0.8515606 382929.4  Fold084
## 85  1886622.7 0.8638038 925501.6  Fold085
## 86  1051618.3 0.9418650 582945.1  Fold086
## 87   311014.7 0.9724418 263592.2  Fold087
## 88   730867.4 0.7781796 410413.5  Fold088
## 89  2763407.2 0.7963621 890907.4  Fold089
## 90   650293.8 0.8911023 326229.0  Fold090
## 91  1235530.0 0.9189961 537204.1  Fold091
## 92   468250.3 0.9033640 299455.3  Fold092
## 93   751792.9 0.7827611 399809.8  Fold093
## 94   496544.5 0.7432410 297671.9  Fold094
## 95   438598.2 0.8611757 300805.3  Fold095
## 96   374291.7 0.8758935 239814.5  Fold096
## 97   483600.2 0.8948108 342130.8  Fold097
## 98   673718.8 0.8858796 369229.2  Fold098
## 99   490757.6 0.5483328 350915.7  Fold099
## 100  380939.3 0.8137839 226376.2  Fold100
reg.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(
    legend.position = "none"
  )

# extract predictions from CV object
cv_preds <- reg.cv$pred
nrow(MiamiProperties)
## [1] 2627
nrow(cv_preds)
## [1] 2627
#Create dataset with "out of fold" predictions and original data
map_preds <- MiamiProperties %>% 
  rowid_to_column(var = "rowIndex") %>% 
  left_join(cv_preds, by = "rowIndex") %>% 
  mutate(SalePrice.AbsError = abs(pred - SalePrice)) %>% 
  cbind(st_coordinates(st_centroid((.))))

st_crs(map_preds) <- st_crs(all_nhoods)

# plot errors on a map
ggplot() +
  geom_sf(data = all_nhoods_MB, fill = "grey40") +
  geom_sf(data = map_preds, aes(colour = q5(SalePrice.AbsError)),
          show.legend = "point", size = 1) +
  scale_colour_manual(values = palette5,
                      labels=qBr(map_preds,"SalePrice.AbsError"),
                      name="Quintile\nBreaks") +
  labs(title="Absolute sale price errors on the OOF set",
       subtitle = "OOF = 'Out Of Fold'") +
  mapTheme()

#Start Ch4
k_nearest_neighbors = 20
#prices
coords <- st_coordinates(st_centroid(MiamiProperties)) 
# k nearest neighbors
neighborList <- knn2nb(knearneigh(coords, k_nearest_neighbors))
spatialWeights <- nb2listw(neighborList, style="W")
MiamiProperties$lagPrice <- lag.listw(spatialWeights, MiamiProperties$SalePrice)


#errors
Miami.test <-
  Miami.test %>%
  mutate(Regression = "Baseline Regression",
         SalePrice.Predict = ifelse((predict(reg_final_train, Miami.test)) < 0, mean(Miami.training$SalePrice), predict(reg_final_train, Miami.test)),
         SalePrice.Error = SalePrice.Predict - SalePrice,
         SalePrice.AbsError = abs(SalePrice.Predict - SalePrice),
         SalePrice.APE = (abs(SalePrice.Predict - SalePrice)) / abs(SalePrice.Predict))%>%
  mutate(SalePrice.AbsError = replace_na(SalePrice.AbsError, 0))%>%
  mutate(SalePrice.Error=replace_na(SalePrice.Error, 0))

coords.test <-  st_coordinates(st_centroid(Miami.test)) 
neighborList.test <- knn2nb(knearneigh(coords.test, k_nearest_neighbors))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
Miami.test$lagPriceError <- lag.listw(spatialWeights.test, Miami.test$SalePrice.AbsError)

summary(Miami.test$SalePrice.AbsError)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1333   100462   240320   402650   487901 13444949
##lag of price

Miami.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.AbsError))
## Simple feature collection with 1019 features and 88 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 880977.1 ymin: 500856.3 xmax: 944453.6 ymax: 559657.2
## projected CRS:  NAD_1983_StatePlane_Florida_East_FIPS_0901_Feet
## First 10 features:
##           Folio SalePrice saleYear         Property.Address Property.City
## 1  131320040710    355000     2018           3811 NW 12 TER         Miami
## 2  141290210170    393800     2020        4001 WOODRIDGE RD         Miami
## 3  131320041310    304500     2018             951 NW 39 CT         Miami
## 4  232270170570   2150000     2018        3738 PINE TREE DR   Miami Beach
## 5  131340140090    340000     2018            2051 NW 18 ST         Miami
## 6  131330050230    330000     2019            1554 NW 29 CT         Miami
## 7  131260390541    260000     2019            1253 NW 33 ST         Miami
## 8  141150240220   1700000     2018         1900 SECOFFEE ST         Miami
## 9  232220080020    940000     2019         4585 NAUTILUS DR   Miami Beach
## 10 232100121170    432500     2018 7115 TROUVILLE ESPLANADE   Miami Beach
##    Property.Zip County.Taxable City.Taxable AdjustedSqFt LotSize Bed Bath
## 1    33126-2618         290388       290388         1717    6780   2    1
## 2    33133-6617         932499       932499         1506    9380   2    2
## 3    33126-3621         192321       192321          983    8760   2    1
## 4    33140-3936        1748919      1748919         3798   17700   3    3
## 5    33125-1411         254935       254935         2056    8400   4    2
## 6    33125-2036         240596       240596         1654    9420   2    1
## 7    33142-6150         200414       200414         1336    6027   2    1
## 8    33133-3211        1420637      1420637         2572   20000   4    3
## 9    33140-2827         798332       798332         1827   10240   3    3
## 10   33141-3515         312603       312603         1662    6500   2    2
##    Stories Units EffectiveYearBuilt LivingSqFt ActualSqFt    X toPredict
## 1        1     1               1997       1548       2429 <NA>         0
## 2        1     1               2018       1583       1583 <NA>         0
## 3        1     1               1980        871       1182 <NA>         0
## 4        2     1               1995       3647       4286 <NA>         0
## 5        1     1               1975       1803       2535 <NA>         0
## 6        1     1               1949       1152       2664 <NA>         0
## 7        1     1               1930       1336       1336 <NA>         0
## 8        1     1               1950       2572       2572 <NA>         0
## 9        1     1               1960       1827       1827 <NA>         0
## 10       1     1               1947       1655       1675 <NA>         0
##       pool singlefamily estates    dock Age 0 luxury elevator Gazebo BeachView
## 1  No Pool          Yes      No No Dock  23 0     No       No     No        No
## 2  No Pool          Yes      No No Dock   2 0     No       No     No        No
## 3  No Pool          Yes      No No Dock  40 0     No       No     No        No
## 4     Pool           No     Yes No Dock  25 0     No       No     No        No
## 5  No Pool           No      No No Dock  45 0     No       No     No        No
## 6  No Pool           No      No No Dock  71 0     No       No     No        No
## 7  No Pool           No      No No Dock  90 0     No       No     No        No
## 8  No Pool          Yes      No No Dock  70 0     No       No     No        No
## 9     Pool          Yes      No No Dock  60 0     No       No     No        No
## 10 No Pool          Yes      No No Dock  73 0     No       No     No        No
##      CoastDist milecoast halfmilecoast bars_nn1 bars_nn2 bars_nn3 bars_nn4
## 1  2.962703488        No            No 1579.023 1728.056 1783.866 1818.521
## 2  0.338089977       Yes           Yes 7429.355 7431.387 7510.776 7562.259
## 3  2.968909961        No            No 1114.699 1170.437 1191.080 1212.170
## 4  0.008728162       Yes           Yes 1060.072 1208.241 1391.619 1690.378
## 5  1.508557718        No            No 5869.254 6483.301 6725.936 6886.482
## 6  2.190778631        No            No 3440.014 4040.681 4702.859 5148.298
## 7  1.290936092        No            No 4905.154 5330.119 5491.495 5605.220
## 8  0.162915663       Yes           Yes 2016.365 2701.140 3035.818 3205.692
## 9  0.122984722       Yes           Yes 3240.267 3798.052 4319.211 4600.420
## 10 0.254416641       Yes           Yes 1928.127 1928.497 1943.482 1952.119
##    bars_nn5 bars_Buffer sexualoffenders_Buffer crime_nn1 crime_nn2 crime_nn3
## 1  1842.905           0                      0 1370.7684  1568.139  1761.061
## 2  7606.466           0                      0 5278.8628  5596.144  5702.356
## 3  1226.014           0                      0 1652.0567  1946.055  2067.804
## 4  1870.726           0                      0 7094.2441  7377.017  7485.720
## 5  6983.101           0                      1  316.1319   863.202  1219.966
## 6  5548.213           0                      0 1406.6022  1406.602  1456.863
## 7  5682.011           0                      0  845.1109  1337.243  1508.471
## 8  3411.116           0                      0 2121.7092  2973.869  3265.542
## 9  5015.358           0                      0 6225.9584  6989.676  8095.403
## 10 2134.229           0                      0 1653.4457  3353.656  3930.982
##    crime_nn4 crime_nn5 parks_nn1 parks_nn2 parks_nn3 parks_nn4 parks_nn5
## 1   1896.454  2067.695 1876.0041  2029.165  2958.666  3837.817  4431.129
## 2   7042.636  7915.235  840.0082  1283.240  1441.323  1600.715  2264.571
## 3   2249.676  2516.068 2956.0521  3036.696  3314.580  4009.695  4617.866
## 4   7612.477  7779.277 1869.1895  2109.965  2217.044  2328.252  2406.506
## 5   1481.557  1643.815  849.8401  1329.159  1751.722  2070.070  2262.667
## 6   1543.299  1657.059 1625.3104  2359.249  2805.104  3127.282  3424.477
## 7   1605.037  1702.643 2704.4825  2949.092  3122.917  3379.478  3567.166
## 8   3631.172  4024.269 1564.4762  1910.319  2258.438  2922.009  3464.650
## 9   8651.327  8994.762 1962.5849  1966.289  1971.318  2368.780  2764.235
## 10  4299.127  4680.126  325.0006  1076.707  1646.523  1993.162  2216.839
##    parks.Buffer  metro_nn1     Halfmile_metro                  elem_name
## 1             0  4735.6939 Non_Halfmile_metro    Kinloch Park Elementary
## 2             0  6566.7689 Non_Halfmile_metro   Coconut Grove Elementary
## 3             0  5708.2272 Non_Halfmile_metro    Kinloch Park Elementary
## 4             0 21477.2030 Non_Halfmile_metro     North Beach Elementary
## 5             0  4695.1523 Non_Halfmile_metro        Comstock Elementary
## 6             0  6247.3365 Non_Halfmile_metro Kensington Park Elementary
## 7             0   575.2278     Halfmile_metro     Santa Clara Elementary
## 8             0  4291.2755 Non_Halfmile_metro   Coconut Grove Elementary
## 9             0 20948.7032 Non_Halfmile_metro     North Beach Elementary
## 10            1 28446.8243 Non_Halfmile_metro Treasure Island Elementary
##               Neighborhood MiamiBeach       GEOID TotalPop MedHHInc MedRent
## 1   West Grapeland Heights         No 12086490100     1881    38250     845
## 2              South Grove         No 12086007300     4037   191250    1900
## 3   West Grapeland Heights         No 12086490100     1881    38250     845
## 4              MIAMI BEACH        Yes 12086004000     5629    92181    1450
## 5              Curtis Park         No 12086003003     6339    25521     888
## 6  North Grapeland Heights         No 12086005001     7763    39704    1156
## 7              Santa Clara         No 12086002501     5880    27540     901
## 8               East Grove         No 12086006802     3373   127900       0
## 9              MIAMI BEACH        Yes 12086004000     5629    92181    1450
## 10             MIAMI BEACH        Yes 12086003915     5799    49662    1258
##    MedInc TotalHH OwnerHH RenterHH HHNoMort Own2017later Own201516 Own201014
## 1   20761    1866    1239      627      157            0         5        45
## 2   91949    3962    3416      546      465           33       138       304
## 3   20761    1866    1239      627      157            0         5        45
## 4   46659    5422    3637     1785      640           18        78       223
## 5   14949    6255     863     5392       98            0         0        30
## 6   16315    7630    4493     3137      559           36        91       218
## 7   17383    5759    1637     4122      192           32        12        78
## 8   74966    3308    2573      735      539           55       141       285
## 9   46659    5422    3637     1785      640           18        78       223
## 10  24668    5756    2333     3423      342           16       130       102
##    HH200k pctWhite   pctBlack   pctHis pctBlackorHis pctBachelors pctPoverty
## 1       8 91.86603  3.6682616 94.09888      97.76715    0.4253057  19.191919
## 2     733 88.33292  1.4862522 41.93708      43.42333    0.3467922  11.072579
## 3       8 91.86603  3.6682616 94.09888      97.76715    0.4253057  19.191919
## 4     438 91.33061  0.8527269 28.69071      29.54344    1.5988630   8.296323
## 5      16 86.65405  9.6229689 91.08692     100.70989    0.3312825  39.391071
## 6       0 93.48190  1.4427412 97.42368      98.86642    1.3525699  16.617287
## 7      23 61.30952 23.1802721 83.35034     106.53061    0.0000000  42.329932
## 8     507 96.70916  0.3854136 45.33057      45.71598    1.0672991   9.546398
## 9     438 91.33061  0.8527269 28.69071      29.54344    1.5988630   8.296323
## 10     86 74.58182  1.7589240 71.75375      73.51267    2.3624763  19.951716
##    pctCarCommute pctPubCommute pctOwnerHH pctRenterHH pctNoMortgage
## 1       88.09284     3.6326942   66.39871    33.60129      8.413719
## 2       79.76756     0.1584786   86.21908    13.78092     11.736497
## 3       88.09284     3.6326942   66.39871    33.60129      8.413719
## 4       61.01340     2.2194305   67.07857    32.92143     11.803762
## 5       80.27330    12.8241065   13.79696    86.20304      1.566747
## 6       92.67944     3.9748356   58.88598    41.11402      7.326343
## 7       78.38672    13.1995369   28.42507    71.57493      3.333912
## 8       79.54393     0.4024145   77.78114    22.21886     16.293833
## 9       61.01340     2.2194305   67.07857    32.92143     11.803762
## 10      76.03942     5.0816138   40.53162    59.46838      5.941626
##    pctOwnerSince2010 pctHH200kOrMore year                  geometry
## 1           4.035513       0.4287245 2018 POINT (899958.5 528010.9)
## 2          13.905152      18.5007572 2018 POINT (900622.5 501244.8)
## 3           4.035513       0.4287245 2018 POINT (899458.4 527050.3)
## 4           8.770965       8.0781999 2018 POINT (943195.3 538198.7)
## 5           3.476246       0.2557954 2018     POINT (909818 530781)
## 6           7.678611       0.0000000 2018   POINT (905278 529658.8)
## 7           7.452657       0.3993749 2018 POINT (913864.7 536521.7)
## 8          18.694131      15.3264813 2018 POINT (910913.1 512757.6)
## 9           8.770965       8.0781999 2018 POINT (940728.9 541686.6)
## 10         10.630090       1.4940931 2018 POINT (939845.8 553167.4)
##             Regression SalePrice.Predict SalePrice.Error SalePrice.AbsError
## 1  Baseline Regression          996882.7       641882.72          641882.72
## 2  Baseline Regression          710422.4       316622.38          316622.38
## 3  Baseline Regression          341087.3        36587.28           36587.28
## 4  Baseline Regression         6206694.5      4056694.49         4056694.49
## 5  Baseline Regression          657842.9       317842.93          317842.93
## 6  Baseline Regression         1220939.5       890939.53          890939.53
## 7  Baseline Regression          409873.8       149873.78          149873.78
## 8  Baseline Regression         1367543.3      -332456.70          332456.70
## 9  Baseline Regression          908499.1       -31500.89           31500.89
## 10 Baseline Regression          138788.3      -293711.74          293711.74
##    SalePrice.APE lagPriceError
## 1     0.64388991      218062.4
## 2     0.44568188      403176.0
## 3     0.10726662      255348.4
## 4     0.65359983      566141.7
## 5     0.48315930      236394.2
## 6     0.72971634      248975.2
## 7     0.36565837      270265.6
## 8     0.24310506      465061.7
## 9     0.03467355      580959.9
## 10    2.11625785      523939.3
ggplot(MiamiProperties, aes(x=lagPrice, y=SalePrice)) +
  geom_point(colour = "#FA7800") +
  geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
  labs(title = "Price as a function of the spatial lag of price",
       caption = "Public Policy Analytics, Figure 6.6",
       x = "Spatial lag of price (Mean price of 5 nearest neighbors)",
       y = "Sale Price") +
  plotTheme()

ggplot(Miami.test, aes(x=lagPriceError, y=SalePrice)) +
  geom_point(colour = "#FA7800") +
  geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
  labs(title = "Error as a function of the spatial lag of price",
       caption = "",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Sale Price") +
  plotTheme()

#Moran's I - A measure of spatial correlation
moranTest <- moran.mc(Miami.test$SalePrice.AbsError, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count",
       caption="Miami") +
  plotTheme()

# this plot shows that there is spatial correlation 

## Provide a map of your predicted values for where ‘toPredict’ is both 0 and 1.

#Errors by group - Neighborhood, Elementary School
##Neighborhood
nhood_sum <- Miami.test %>% 
  group_by(Neighborhood) %>%
  summarize(meanPrice = mean(SalePrice, na.rm = T),
            meanPrediction = mean(SalePrice.Predict, na.rm = T),
            meanMAE = mean(SalePrice.AbsError, na.rm = T))

nhood_sum %>% 
  st_drop_geometry %>%
  arrange(desc(meanMAE)) %>% 
  kable() %>% kable_styling()
Neighborhood meanPrice meanPrediction meanMAE
San Marco Island 11695000.0 6963342.3 4731657.75
Bay Heights 1398000.0 2361979.1 963979.06
MIAMI BEACH 2886936.7 2679476.7 915020.01
South Grove Bayside 2390550.0 2883448.7 685183.49
Highland Park 850000.0 188341.7 661658.31
East Grove 1918371.4 2203690.6 637156.80
Belle Island 5182500.0 5650601.9 631155.06
Grove Center 2475000.0 2468669.8 513263.87
Little River Central 192600.0 701855.9 509255.93
Fair Isle 1974166.7 1879352.1 507144.71
West Grove 481400.0 885745.0 482841.31
Bird Grove East 655000.0 425994.3 480232.37
Northwestern Estates 147083.3 512498.1 431776.49
Edgewater 545000.0 152375.0 392624.96
Baypoint 2160000.0 2153078.5 386170.26
South Grove 1195517.6 1368252.9 379028.41
Bird Grove West 475000.0 651115.5 376337.81
Citrus Grove 308194.7 588761.9 375295.85
Morningside 1016852.6 1051312.3 367481.78
King Heights 223418.2 486216.1 364455.32
Flora Park 187231.6 453878.6 356821.75
Melrose 408671.4 595390.0 335058.81
Hadley Park 204852.9 410665.9 328153.04
North Grapeland Heights 290588.2 555891.3 328063.67
Brentwood 332500.0 659603.8 327103.77
Shenandoah South 458308.1 602663.8 318187.64
Belle Meade 959200.0 1035626.4 312414.03
Belle Meade West 391250.0 699228.0 307977.99
Shorecrest 547500.0 650718.4 293542.36
Curtis Park 284000.0 558372.5 293398.22
Flagami 339740.9 500982.3 289283.12
Orchard Villa 218727.3 415495.6 284769.38
Auburndale 316500.0 545627.9 274538.68
North Grove 1136470.6 1042797.3 269367.36
East Little Havana 424566.7 653691.7 266036.30
Parkdale South 382500.0 452603.7 264816.50
Liberty Square 146168.8 361483.7 260392.62
Edison 221136.0 419123.2 254546.34
North Sewell Park 411250.0 348701.8 249281.22
Douglas Park 350333.3 474881.6 248661.23
Roads 655037.1 672794.5 248431.51
Shenandoah North 476894.7 437459.1 246745.05
Santa Clara 247547.1 460572.9 238849.32
South Grapeland Heights 296812.5 479782.7 236458.26
La Pastorita 372500.0 140718.7 231781.31
Spring Garden 1460000.0 1237770.8 222229.15
Bayside 595307.7 756383.3 217689.54
Coral Gate 472100.0 342632.9 217514.25
Buena Vista West 320305.3 459212.7 207779.01
Silver Bluff 501716.2 522366.8 207169.64
Old San Juan 495175.0 514327.2 194075.24
West Grapeland Heights 344400.0 473866.9 182249.76
Lemon City/Little Haiti 271187.5 405862.0 172148.37
South Sewell Park 384142.9 286857.3 170530.07
Miami Avenue 1461450.0 1443793.8 158016.28
Parkdale North 420983.3 282265.7 138717.64
Latin Quarter 240000.0 104701.5 135298.49
Northeast Overtown 370000.0 501690.1 131690.13
Orange Bowl 250000.0 136304.5 113695.48
Allapattah Industrial District 220000.0 119212.2 100787.83
Historic Buena Vista East 600000.0 685393.3 85393.29
Buena Vista Heights 390171.4 420124.9 74480.80
Civic Center 245000.0 312966.4 67966.43
Palm Grove 740000.0 701222.1 38777.86
map_preds_sum <- map_preds %>% 
  group_by(Neighborhood) %>% 
  summarise(meanMAE = mean(SalePrice.AbsError))

ggplot() +
  geom_sf(data = all_nhoods %>% 
            left_join(st_drop_geometry(map_preds_sum), by = "Neighborhood"),
          aes(fill = q5(meanMAE))) +
  scale_fill_manual(values = palette5,
                    labels=qBr(nhood_sum,"meanMAE"),
                    name="Quintile\nBreaks") +
  mapTheme() +
  labs(title="Absolute sale price errors on the OOF set by Neighborhood")

reg.nhood <- lm(SalePrice ~ ., data = as.data.frame(Miami.training) %>% 
                  dplyr::select(Neighborhood, SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool, 
                                singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))



Miami.test.nhood <-
  Miami.test %>%
  mutate(Regression = "Neighborhood Effects",
         SalePrice.Predict = ifelse((predict(reg.nhood, Miami.test)) < 0, mean(Miami.training$SalePrice), predict(reg.nhood, Miami.test)),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / abs(SalePrice)) %>%
  mutate(SalePrice.AbsError = replace_na(SalePrice.AbsError, 0))%>%
  mutate(SalePrice.Error=replace_na(SalePrice.Error, 0))

bothRegressions <- 
  rbind(
    dplyr::select(Miami.test, starts_with("SalePrice"), Regression, Neighborhood) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
    dplyr::select(Miami.test.nhood, starts_with("SalePrice"), Regression, Neighborhood) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))    


st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -Neighborhood) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
  summarize(meanValue = mean(Value, na.rm = T)) %>%
  spread(Variable, meanValue) %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#25CB10") %>%
  row_spec(2, color = "black", background = "#FA7800")
Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 402650.0 1.433111
Neighborhood Effects 401716.9 1.018834
bothRegressions %>%
  dplyr::select(SalePrice.Predict, SalePrice, Regression) %>%
  ggplot(aes(SalePrice, SalePrice.Predict)) +
  geom_point() +
  stat_smooth(aes(SalePrice, SalePrice), 
              method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(SalePrice.Predict, SalePrice), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme() + theme(plot.title = element_text(size = 18, colour = "black")) 

#map MAPE by neighborhoods
st_drop_geometry(bothRegressions) %>%
  group_by(Regression, Neighborhood) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(all_nhoods) %>%
  st_sf() %>%
  ggplot() + 
  geom_sf(aes(fill = mean.MAPE)) +
  geom_sf(data = bothRegressions, colour = "black", size = .5) +
  facet_wrap(~Regression) +
  scale_fill_gradient(low = palette5[1], high = palette5[5],
                      name = "MAPE") +
  labs(title = "Mean test set MAPE by neighborhood") +
  mapTheme()

#Race Context and Income Context (from book)
tracts18 <- 
  tracts18%>%
  mutate(raceContext = ifelse(pctWhite > .75, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedInc > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
             ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = raceContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
               labs(title = "Race Context") +
               mapTheme() + theme(legend.position="bottom"), 
             ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = incomeContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
               labs(title = "Income Context") +
               mapTheme() + theme(legend.position="bottom"))

st_join(st_centroid(bothRegressions), tracts18) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood racial context") %>%
  kable_styling("striped", full_width = F)
Test set MAPE by neighborhood racial context
Regression Majority White
Baseline Regression 143%
Neighborhood Effects 102%
st_join(st_centroid(bothRegressions), tracts18) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context") %>%
  kable_styling("striped", full_width = F)
Test set MAPE by neighborhood income context
Regression High Income Low Income
Baseline Regression 44% 180%
Neighborhood Effects 35% 126%
##Elementary School
elem_sum <- Miami.test %>% 
  group_by(elem_name) %>%
  summarize(meanPrice = mean(SalePrice, na.rm = T),
            meanPrediction = mean(SalePrice.Predict, na.rm = T),
            meanMAE = mean(SalePrice.AbsError, na.rm = T))

elem_sum %>% 
  st_drop_geometry %>%
  arrange(desc(meanMAE)) %>% 
  kable() %>% kable_styling()
elem_name meanPrice meanPrediction meanMAE
South Pointe Elementary 10262857.1 6064669.6 4288048.6
North Beach Elementary 3201125.0 3100833.2 948887.1
Carver, George W. Elementary 337380.0 791175.2 642186.3
Treasure Island Elementary 1346104.8 1457182.6 537725.4
Coconut Grove Elementary 1436635.2 1622857.7 462832.9
Flagami Elementary 336687.5 607747.9 432380.2
Pharr, Kelsey L. Elementary 268777.8 649581.5 410085.7
Holmes Elementary 145750.0 470684.7 389572.7
Biscayne Elementary 1486871.4 1516256.4 376600.2
Liberty City Elementary 136400.0 501354.2 371631.5
Olinda Elementary 181742.9 431463.3 360742.2
McCrary, Jr. Jesse J. Elementary 396714.3 650585.1 360201.6
Dunbar, Paul L. Elementary 572960.0 548432.6 329027.6
Morningside K-8 Center 1132992.8 1232758.2 324238.7
Citrus Grove Elementary 326012.0 455099.2 323790.4
Orchard Villa Elementary 223422.2 425251.9 317732.1
Flagler, Henry M. Elementary 320257.6 447547.1 311170.5
Smith, Lenora Braynon. Elementary 208989.4 386916.7 310679.3
Miller, Phyllis R. Elementary 568531.2 666815.5 302626.0
Kensington Park Elementary 295544.1 520664.9 293127.6
Tucker, Frances S. Elementary 363310.5 530961.7 282689.9
Fairlawn Elementary 337500.0 519064.7 280641.2
Fienberg/Fisher K-8 Center 1015833.3 1051658.4 279053.4
Comstock Elementary 275000.0 509934.4 277320.7
Shenandoah Elementary 445589.7 501830.8 263744.4
Wheatley, Phillis Elementary 457500.0 327032.6 262157.5
Angelou, Maya Elementary 280116.7 451369.5 261926.7
Coral Gables Preparatory Academy 442769.2 405544.2 259532.2
Coral Way K-8 Center 613216.2 649422.5 255870.7
L’ouverture, Toussaint Elementary 303672.0 505200.8 253107.1
Auburndale Elementary 371518.2 517834.2 248927.9
Riverside Elementary 398200.0 575264.6 247359.5
Silver Bluff Elementary 488725.5 495055.1 224921.0
Kinloch Park Elementary 355796.8 452824.4 216684.9
Santa Clara Elementary 239350.0 402655.0 189138.3
Edison Park K-8 Center 199585.7 385114.6 185528.9
Hartner, Eneida M. Elementary 455728.6 527093.3 181258.4
Shadowlawn Elementary 297396.3 385640.2 158815.6
Melrose Elementary 327000.0 173527.0 153473.0
Drew, Charles R. K-8 Center 152050.0 185972.7 114101.0
map_preds_sum_elem <- map_preds %>% 
  group_by(elem_name) %>% 
  summarise(meanMAE = mean(SalePrice.AbsError))

ggplot() +
  geom_sf(data = elem_map %>% 
            left_join(st_drop_geometry(map_preds_sum_elem), by = "elem_name"),
          aes(fill = q5(meanMAE))) +
  scale_fill_manual(values = palette5,
                    labels=qBr(nhood_sum,"meanMAE"),
                    name="Quintile\nBreaks") +
  mapTheme() +
  labs(title="Absolute sale price errors on the OOF set by Elementary School District")

reg.elem <- lm(SalePrice ~ ., data = as.data.frame(Miami.training) %>% 
                  dplyr::select(elem_name, SalePrice, LotSize, Bed, Bath, Age, ActualSqFt, parks_nn1, MedHHInc, pctHis, pctBachelors, pool, 
                                singlefamily, pctOwnerHH, pctCarCommute, pctHH200kOrMore, Halfmile_metro, luxury, elevator, dock, sexualoffenders_Buffer, estates))

Miami.test.elem <-
  Miami.test %>%
  mutate(Regression = "Elementary School Effects",
         SalePrice.Predict = predict(reg.elem, Miami.test),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / abs(SalePrice)) %>%
  mutate(SalePrice.AbsError = replace_na(SalePrice.AbsError, 0))%>%
  mutate(SalePrice.Error=replace_na(SalePrice.Error, 0))

threeRegressions <- 
  rbind(
    dplyr::select(Miami.test, starts_with("SalePrice"), Regression, Neighborhood, elem_name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
    dplyr::select(Miami.test.nhood, starts_with("SalePrice"), Regression, Neighborhood, elem_name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
    dplyr::select(Miami.test.elem, starts_with("SalePrice"), Regression, Neighborhood, elem_name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error))) 


st_drop_geometry(threeRegressions) %>%
  gather(Variable, Value, -Regression, -Neighborhood,-elem_name) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
  summarize(meanValue = mean(Value, na.rm = T)) %>%
  spread(Variable, meanValue) %>%
  kable() %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(1, color = "black", background = "#25CB10") %>%
  row_spec(2, color = "black", background = "#FA7800") %>%
  row_spec(3, color="black", background = "#0082FA")
Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 402650.0 1.4331108
Elementary School Effects 383564.0 0.7952907
Neighborhood Effects 401716.9 1.0188337
threeRegressions %>%
  dplyr::select(SalePrice.Predict, SalePrice, Regression) %>%
  ggplot(aes(SalePrice, SalePrice.Predict)) +
  geom_point() +
  stat_smooth(aes(SalePrice, SalePrice), 
              method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(SalePrice.Predict, SalePrice), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme() + theme(plot.title = element_text(size = 18, colour = "black")) 

#map MAPE by neighborhoods (need to figure out how to add the elementary school basemap in here)
st_drop_geometry(threeRegressions) %>%
  group_by(Regression, Neighborhood, elem_name) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(all_nhoods) %>%
  st_sf() %>%
  ggplot() + 
  geom_sf(aes(fill = mean.MAPE)) +
  geom_sf(data = bothRegressions, colour = "black", size = .5) +
  facet_wrap(~Regression) +
  scale_fill_gradient(low = palette5[1], high = palette5[5],
                      name = "MAPE") +
  labs(title = "Mean test set MAPE by neighborhood") +
  mapTheme()

#Race Context and Income Context (from book)
st_join(st_centroid(threeRegressions), tracts18) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood racial context")
Test set MAPE by neighborhood racial context
Regression Majority White
Baseline Regression 143%
Elementary School Effects 80%
Neighborhood Effects 102%
st_join(st_centroid(threeRegressions), tracts18) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context")
Test set MAPE by neighborhood income context
Regression High Income Low Income
Baseline Regression 44% 180%
Elementary School Effects 44% 93%
Neighborhood Effects 35% 126%
##### PREDICTION #####
secret_data <- MiamiPropertiesPred
secret_preds <- predict(reg5, newdata = secret_data)
output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = "Miami Heat Predictors")

5. Discussion

<<<<<<< HEAD # 6. Predicitons

======= # 6. Conclusion

>>>>>>> 30a48337056764284380a80a0436d5403662727e

  1. Zillow. 2020. “Zestimate.” https://www.zillow.com/zestimate/↩︎